Forum Moderators: open

Message Too Old, No Replies

Google refuses to spider site. It has been more than a year!

Google hits the index page and goes no further.

         

Crow_Song

5:34 pm on Sep 8, 2003 (gmt 0)

10+ Year Member



I am the web developer for the Faculty of Applied Science at a Canadian university. About a year ago, I redesigned one of our department's sites. It was not the smoothest transition: we moved it to a different server, I changed the structure of the directories and the names of pages, and we even changed the domain name and IP (it was using two domain names, and we dropped one). I expected Google to take a few months to re-index the site, but now a year later, I have a lot of angry profs blaming me for Google not listing their research.
I can't figure it out. I am at a loss. The code itself is not the problem...I am using a template that I also use on several other departments. There is some server-side scripting and includes, but nothing weird. Google has recently finally assigned a PR of 6 to the homepage, and a PR of 5 to several pages one level in. But it will not spider the site, nor will it assign a page rank of more than 0 to any other pages on the site (at least 0 is an improvement! Until a few weeks ago, there was no rank at all). I watch the logs every day, and Google hits the robots.txt page, and then the index page...and that's it. It's been doing that for months and months, never going to another page. The robots page excludes only a testing directory, and nothing more (I've even tried removing it altogether).

There are tons of links to the site, and I have painstakingly contacted webmasters from sites who listed the old url. They have updated their links, but it doesn't seem to matter.

Anyone have any ideas? I have tried submitting the site to DMOZ, but they have not listed it.

Thanks for any help. I am at my wit's end.

GoogleGuy

5:28 pm on Sep 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



kaled, that's a good suggestion. It may take a while (the webmaster pages are translated into 10 different languages, so it isn't trivial to update them), but I'll see what I can do.

Fearless

8:23 pm on Sep 9, 2003 (gmt 0)

10+ Year Member



A very simple all HTML site map at root level has been working for me. Sort of time consuming to keep updated but it works like a charm. The bot's show up, the bots follow. If a page is on the Map, it gets listed every time.

willybfriendly

8:35 pm on Sep 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A very simple all HTML site map at root level has been working for me. Sort of time consuming to keep updated but it works like a charm.

One word - xenu

WBF

Freshman

8:17 am on Sep 10, 2003 (gmt 0)



A very simple all HTML site map at root level has been working for me.

That's a nice idea, Fearless! Have you been waiting for bot to pick it up or just typed it in at addurl.html?

xlcus

10:55 am on Sep 10, 2003 (gmt 0)

10+ Year Member



One word - xenu

xenu?

Quadrille

11:37 am on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



xenu?
Try "Xenu's Link Sleuth"

Crow_Song

12:48 pm on Sep 10, 2003 (gmt 0)

10+ Year Member



I've got a site map, but it doesn't seem to matter. The Googlebot has never visited it. Just index page, robots page, then...gone.

twilight47

2:39 pm on Sep 10, 2003 (gmt 0)

10+ Year Member



How about trying 1 static text link to the site map from your index page. This also assumes the the site map has static text links and not JS. Just a suggestion.:)

Crow_Song

2:54 pm on Sep 10, 2003 (gmt 0)

10+ Year Member



Hi twilight47 - I do have a static text link to the site map from the home page (right at the bottom of the page). Googlebot has never followed the link, though.

swerve

3:32 pm on Sep 10, 2003 (gmt 0)

10+ Year Member



I do have a static text link to the site map from the home page

IMO, that doesn't count as a "static text link". Remove the JavaScript (onmouseover) stuff from the link.

Use clean hrefs for all your links (no JS, mouseover, etc.) and these pages will all be indexed within 2 days.

Note that this is the same advice offered by GoogleGuy in message 24 (my emphasis in bold added):

But my main advice is still to get a few more links (e.g. from within the university; campus directory, etc.), and instead of the javascript-y mouseovers, go with static links without fragments

I strongly suggest that you take this advice :-)

claus

4:43 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This thread headline was vey intriguing, so after reading the posts i just had to take a look at the page, sorry about all the stupid comments that i inevitably will post now, but at least it will be operational, general goodstuff, and for free:

First things first: Crow_Song, welcome to WebmasterWorld :)

- try running the index page through this address: [validator.w3.org...]

You will need to choose an encoding and a document-type first, as your HTML does not specify this, i tried 4.1 Transitional and iso-8859-1. There's quite a few error lines, but i've seen worse on pages that did well in the SERPS. Anyway, a cleanup might do you good, just for the sake of your flesh-and-blood visitors, if nothing else (this page will crash some browsers).

Then, try running it through this if you haven't got Lynx on your system (a "real lynx" would be better): [delorie.com...]

That's sort of what a spider will see on the page, it's just a bit nicer to look at (eg. the centered text is not centered to a spider). You might consider that these links are repeated three times on the page, and the page has almost no other spiderable content:

Alumni, Graduates, Undergraduates, Department, Research, People

The site-map link is also repeated twice. Some spiders are intelligent bastards, and they tend to regard repetitions as an insult to their intelligence. Try just having each link one time. Oh, and yes, the links in the "ilayer / layer" thingy does not show. Try a div - or better still, keep them in the open in stead of playing hide-and-seek with DHTML, spiders are too serious for play.

Then, in the source code of your index page, count the number of characters...well, don't bother - i can tell you that you have 15.127 characters. 782 of these characters (5%) are visible to the spider - they are the ones shown at my link #2 (the Lynx viewer).

You really wouldn't want to use too many characters that the spider can't digest. Your page is simply too fat, with a page-fat index(*) of 95%.

It's all that javascript (-why?) and graphics. Links like this one is no good for a spider:

<a href="/department/" onmouseover="changetext(content[4]);rollover('link4','on');parent.status='Department';return true" onmouseout="rollover('link4','off')"><img src="shared/bodylinks/link4_6off.gif" height="26" alt="Department" name="link4" border="0"></a>

For maximum efficiency, just cut down on the cholesterole and serve this in stead - a regular text link (and just one):

<a href="/department/">Department</a>

Your page is 15K (pure text without graphics), i'm sure you can get it down under 5K, keep all links in the open and and even add spiderable body text to the equation. Start by stripping the statusbar tricks - then make the "ilayer / layer" a div in stead and make those javascript-interchanging links into something the spider can understand and follow - and reduce to just one set of links.

In other words, it's a remake. That index page has a Tbar PR of 6 and there's a huge set of pages behind it too, so the only conceivable reason (imho) that Gbot wouldn't want to eat more of the cake you're offering is that it simply chokes on the very first page.

And yes. You can make a great/official/whatever looking design as well, but the code simply has to become more efficient.

There i go, giving it all away for free. Well, at least some. Just thought i'd share a bit of joy :)

/claus


(*) The term "page-fat index" is hereby published to the public domain according to the GNU license; give it away or earn money on it or even derivatives from it, but just don't say you coined it or that it's yours ;)

Chndru

5:52 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



but just don't say you coined it or that it's yours ;)

wtg claus!

g1smd

7:21 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just took the fat out of a 45K page with loads of repetitive HTML in it, randomly opening and closing font tags sometimes in mid-paragraph. It was a mess to look at the source code. Most of what was there was not needed. I cleaned the code, put section headings in <hx> tags (instead of <font size="6"> tags), and cut the page down to 33K total. In a month it went from #80 to #7 in the SERPs (in 5000 results).

Crow_Song

8:06 pm on Sep 10, 2003 (gmt 0)

10+ Year Member



Wow! Thanks for all of the great advice, everyone. I didn't realize that when people were refering to javascript-y stuff that included the parent.status onmouseovers. That always seemed so benign to me. That's an easy one to change.

As for the design changes that you suggest, claus...I'm going to have to find another way, unfortunately. I know that many webmasters would like to see really clean and pure code - even plain text and few graphics. But I've also had to carefully create pages that are exciting to my bosses, seem cutting-edge enough for students and profs, have good cross-platform and backwards-compatibility for the profs here still running Netscape 3 (and there are more than you'd think!), still works in Lynx (I have tested it - I thought it was pretty good! ha ha!), etc.
:)
I don't want you to think I'm not considering your advice. It will certainly make me look back at the code more closely. But I can't redesign the page dramatically - it took months for every department to agree on a design that would be used faculty-wide.

The other thing I keep coming back to is, of course, the fact that the exact same template works just fine for every one of the departments and for the faculty site, but not for this one department. That's why I keep thinking it must be a hardware problem...a server configuration issue.

Anyway...I'll remove what little javascript I have left in the static links. Does Google really choke on that? All I really need is for Google to make it INSIDE...in to the content pages where I don't have any DHTML. No scripting. Just plain text and simple graphics. If I can point Googlebot to that stuff, I know it'll be happy.

Cheers,
Crow

claus

8:38 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's too late to edit my last post, but i just spotted this minor detail:

<a href="/department/">Department</a>

- relative links from one level to a subfolder of the same level (eg. from the index page to a first level folder like the one above) do not really need the slash in front of them, so personally i never do it like this, although i guess it's perfectly valid.

You could write it like this in stead, and you would save the User-Agent the one step of locating your document root (which it is on already) before moving on to request the sub-folder:

<a href="department/">Department</a>

In general, you should always be a bit careful with up-stream indicators in relative links (like "/" and "./" and "../" and "../../"). Afaik, Gbot is quite good at following them but some of the other bots tend to get lost occasionally.

It's just a minor thing but it saves you one extra character for each link on the indexpage after all. Keep the slash after the directory name though, as otherwise your server will have to do a 302 redirect each time the directory is requested without it ;)

/claus

Crow_Song

8:52 pm on Sep 10, 2003 (gmt 0)

10+ Year Member



Thanks claus - I appreciate you taking the time to examine the code so thoroughly.

If the link you are refering to is in the links at the top or bottom (header or footer), the slash before the link enables me to use includes so that every page's links will work no matter where someone might move them. I also try to stay away from ../ since it will break if anyone decides to relocate the file to a directory on a different level.

I'll have a look and see if there are superfluous slashes that I can remove.

Cheers,
Crow

claus

8:59 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> it must be a hardware problem...a server configuration issue.

duh... obvious! Why didn't i think of that? Try requesting the page using HTTP1.0 in stead of HTTP1.1 and you'll get a 404. Then look through you logs for the entries Gbot makes. It's the same with appsci though.

/claus


Edit: Something is wrong with my own setup of Sam Spade - i don't think this 404 is real as i keep getting funny errors with it right now...

[edited by: claus at 9:55 pm (utc) on Sep. 10, 2003]

g1smd

8:59 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In general, you should always be a bit careful with up-stream indicators in relative links (like "/" and "./" and "../" and "../../").

Actually, isn't a single / at the beginning of the path, and without any preceding dots, an absolute URL counted from the root, not a relative URL at all?

claus

9:16 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> relative / absolute?

Perhaps it's just a matter of wording, but i'd say it was relative, as it is (relative to) the root of the same domain the link is on. An absolute link would include the domain name and the http:// - such a link would not be relative to any domain, as it would point to the same document space, no matter which host/domain it was on.

/claus

Mohamed_E

9:28 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try requesting the page using HTTP1.0 in stead of HTTP1.1 and you'll get a 404.

Fascinating observation, Clause. What tool did you use to test the response to 1.1 vs 1.0? And what might cause such a behavior? I do not recall reading anything along those lines previously.

Powdork

9:30 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What I would do is keep your main links, the ones that call up the different information about the links, the same as it is now since that does provide a function. You can achieve the hover effect on your footer with simple .css by creating a hover style. Spiders have no problem with this and you can position the code at the top (above the indexed fat;)) while the footer remains where it is.

jim_w

9:30 pm on Sep 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I had a similar problem after a site redesign. I had a bunch of 301¡¯s and google stopped indexing. It would get the 301 and go away without getting the new page. I fell from 2nd page to bottom of 3rd over a six week period.

I changed all the 301¡¯s to 302¡¯s and google bot started indexing again. It was the only change I made and now I¡¯m back on the 2nd page. If you have a bunch of 301¡¯s you might want to try to change them to 302¡¯s to see what happens.

claus

7:29 am on Sep 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jim_w:
The 301/302 issue is a different one - for some reason Gbot still has dificulties following these immediately, although if you read older posts (from the start of this year and back) it's definitely better now, as pages do not drop out entirely. It takes some time - around a month - before it's straightened out in the index, then the pages return to their normal position.

I have no idea why G has introduced this lag-time (apart from a guess; to keep webmasters from using it as a site promotion tool) and it's clearly againts the right way to treat a 301 redirect (which is: The page you have requested is moved permanently, quit using the old address and start using the new now - it's the "now" part that Gbot does not follow)

Mohamed_E:
SamSpade is your friend, get it and you will wonder what you did before. Something was perhaps wrong with my configuration, i got these 404s anywhere i tried HTTP 1.0, so this might not have been the real issue. I still haven't figured out what went wrong, but it's on my side i think, possibly an

Error 40
;)

Powdork:
You're absolutely right. And it is possible to make lean and standards compliant code that will not only look good but also work right back to the Lynx and mosaics ("work" meaning displaying all the content, even somewhat nicely, and not crashing the browser). I believe "lean" is the most important word in this respect.

/claus

Freshman

1:24 pm on Sep 11, 2003 (gmt 0)



If you have a bunch of 301's you might want to try to change them to 302's to see what happens.

Please somebody explain the 301/302 issue to me.
I'm just a newbie webmaster and it doesn't say in the glossary...

Fearless

3:32 pm on Sep 11, 2003 (gmt 0)

10+ Year Member



I've got a site map, but it doesn't seem to matter. The Googlebot has never visited it. Just index page, robots page, then...gone.

I was at this point for a month or so with a new non-profit information site. I tried several tactics. I got more links and I "looked up" internal pages using the Toolbar.

The Bots eventually came around. Having spoken with other not-for-profit web masters, it seems that Google has such sites on a slow "first full crawl" schedule- IF you don't have excellent PageRank to start with. (I did notice G-Guys comment that you need more inbound links- that's significant) It seems that there is some sort of formula in Google's newly patented algo where the higher your PageRank the sooner and more frequently the bots hit. I guess that its a better way to allocate spider resources. Makes sense I suppose.

Since then, I've tried to seek out more inbound links and add content and now the bots' come around on a regular basis- usually once a week or more (if you count robots.txt visits). (And we're up to a toolbar rank of 4.)

GoogleGuy

3:32 pm on Sep 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just to echo claus: if you have absolute links, you know that it's much harder for a spider to mess up. If you have relative links, there's always a chance that you didn't write it correctly, or the spider won't decipher it correctly.

Crow_Song

3:56 pm on Sep 11, 2003 (gmt 0)

10+ Year Member



Wow - thanks again for the fantastic advice, everyone. I wish I had become a part of this community months ago - it may have saved some of the walls around here from me bashing my head against them.

The links on the site are frequently tested and verified. I don't tend to use relative links - unless it's relative to the root. ie. none of this: "../../../"

claus - the header info is very interesting. I'm going to explore this further with the sysadmin. We have found it very difficult to troubleshoot IIS (neither of us are Windows-types) and to determine if the problem is server configuration or my code.

Cheers,
Crow

wkitty42

11:25 pm on Sep 11, 2003 (gmt 0)

10+ Year Member



claus,

i don't think that the 404 is your sammy's problem... i get it here, too, when using http 1.0...

i would almost hazard a guess that their IIS 5.0 has been tuned (or something) to not handle http 1.0 requests... however, http 0.9 gets right thru <<scratching head>>

very very very wierd...

[seconds pass]

ahha! found it...

http 0.9 and 1.1 send "GET /"

http 1.0 sends "GET /appsci.queensu.ca/"

definite difference there...

FWIW: i see the same thing when accessing my own Apache server... maybe it is a bug in sammy?

[time passes]

ahh fooey... the safebrowser at samspade.org doesn't appear to do anything other than http 1.1 ;(

plumsauce

8:49 am on Sep 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




re: HTTP/1.0

In IIS, if the site is sharing an ip address,
and therefore, IIS has been set up with host
headers, *and* no host header is sent by the client,
you will get a 404.

Most browsers when using HTTP/1.0 will send
the host headers, but if not ....

++++

plumsauce

9:55 am on Sep 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



re:
http 1.0 sends "GET /appsci.queensu.ca/"

What piece of software is sending this?

If G is using the same library function in the
spider, then this has implications for all of
us. Especially if urlscan is in use. In the stock
configuration, the two dots alone will cause the
request to be summarily dropped on the floor.

The above request is clearly not HTTP/1.0 compliant.
It might be HTTP/0.9 compliant, but G clearly identifies
itself as HTTP/1.0. In that case, a server is not
required to process the request in the format shown
above. In the above case, G would have requested:

GET /appsci.queensu.ca/ HTTP/1.0

As a matter of fact, that exact request made manually
via telnet brings up the standard IIS 404 page.

from rfc1945


The two options for Request-URI are dependent on the nature
of the request.

The absoluteURI form is only allowed when the request is
being made to a proxy.
The proxy is requested to forward
the request and return the response.

...

An example Request-Line would be:
GET http://www.w3.org/pub/WWW/TheProject.html HTTP/1.0

...

The most common form of Request-URI is that used to
identify a resource on an *origin server* or gateway. In
this case, only the absolute path of the URI is transmitted
(see Section 3.2.1, abs_path).

For example, a client wishing to retrieve the resource
above directly from the origin server would create a TCP
connection to port 80 of the host "www.w3.org" and send
the line:

GET /pub/WWW/TheProject.html HTTP/1.0

notice when talking to an origin server as differentiated
from a proxy, no hostname is permitted in the request line
for HTTP/1.0.

full shared hosting support was introduced in HTTP/1.1
rfc 2616 using the host header. some clients adopted
the host header even when identifying themselves as
HTTP/1.0, eg. Netscape 4.x

The real question is what is G sending exactly?
Anyone have server logs covering this?

++++

This 76 message thread spans 3 pages: 76