Forum Moderators: open
There are tons of links to the site, and I have painstakingly contacted webmasters from sites who listed the old url. They have updated their links, but it doesn't seem to matter.
Anyone have any ideas? I have tried submitting the site to DMOZ, but they have not listed it.
Thanks for any help. I am at my wit's end.
A very simple all HTML site map at root level has been working for me.
That's a nice idea, Fearless! Have you been waiting for bot to pick it up or just typed it in at addurl.html?
I do have a static text link to the site map from the home page
Use clean hrefs for all your links (no JS, mouseover, etc.) and these pages will all be indexed within 2 days.
Note that this is the same advice offered by GoogleGuy in message 24 (my emphasis in bold added):
But my main advice is still to get a few more links (e.g. from within the university; campus directory, etc.), and instead of the javascript-y mouseovers, go with static links without fragments
First things first: Crow_Song, welcome to WebmasterWorld :)
- try running the index page through this address: [validator.w3.org...]
You will need to choose an encoding and a document-type first, as your HTML does not specify this, i tried 4.1 Transitional and iso-8859-1. There's quite a few error lines, but i've seen worse on pages that did well in the SERPS. Anyway, a cleanup might do you good, just for the sake of your flesh-and-blood visitors, if nothing else (this page will crash some browsers).
Then, try running it through this if you haven't got Lynx on your system (a "real lynx" would be better): [delorie.com...]
That's sort of what a spider will see on the page, it's just a bit nicer to look at (eg. the centered text is not centered to a spider). You might consider that these links are repeated three times on the page, and the page has almost no other spiderable content:
Alumni, Graduates, Undergraduates, Department, Research, People
The site-map link is also repeated twice. Some spiders are intelligent bastards, and they tend to regard repetitions as an insult to their intelligence. Try just having each link one time. Oh, and yes, the links in the "ilayer / layer" thingy does not show. Try a div - or better still, keep them in the open in stead of playing hide-and-seek with DHTML, spiders are too serious for play.
Then, in the source code of your index page, count the number of characters...well, don't bother - i can tell you that you have 15.127 characters. 782 of these characters (5%) are visible to the spider - they are the ones shown at my link #2 (the Lynx viewer).
You really wouldn't want to use too many characters that the spider can't digest. Your page is simply too fat, with a page-fat index(*) of 95%.
It's all that javascript (-why?) and graphics. Links like this one is no good for a spider:
<a href="/department/" onmouseover="changetext(content[4]);rollover('link4','on');parent.status='Department';return true" onmouseout="rollover('link4','off')"><img src="shared/bodylinks/link4_6off.gif" height="26" alt="Department" name="link4" border="0"></a>
For maximum efficiency, just cut down on the cholesterole and serve this in stead - a regular text link (and just one):
<a href="/department/">Department</a>
Your page is 15K (pure text without graphics), i'm sure you can get it down under 5K, keep all links in the open and and even add spiderable body text to the equation. Start by stripping the statusbar tricks - then make the "ilayer / layer" a div in stead and make those javascript-interchanging links into something the spider can understand and follow - and reduce to just one set of links.
In other words, it's a remake. That index page has a Tbar PR of 6 and there's a huge set of pages behind it too, so the only conceivable reason (imho) that Gbot wouldn't want to eat more of the cake you're offering is that it simply chokes on the very first page.
And yes. You can make a great/official/whatever looking design as well, but the code simply has to become more efficient.
There i go, giving it all away for free. Well, at least some. Just thought i'd share a bit of joy :)
/claus
As for the design changes that you suggest, claus...I'm going to have to find another way, unfortunately. I know that many webmasters would like to see really clean and pure code - even plain text and few graphics. But I've also had to carefully create pages that are exciting to my bosses, seem cutting-edge enough for students and profs, have good cross-platform and backwards-compatibility for the profs here still running Netscape 3 (and there are more than you'd think!), still works in Lynx (I have tested it - I thought it was pretty good! ha ha!), etc.
:)
I don't want you to think I'm not considering your advice. It will certainly make me look back at the code more closely. But I can't redesign the page dramatically - it took months for every department to agree on a design that would be used faculty-wide.
The other thing I keep coming back to is, of course, the fact that the exact same template works just fine for every one of the departments and for the faculty site, but not for this one department. That's why I keep thinking it must be a hardware problem...a server configuration issue.
Anyway...I'll remove what little javascript I have left in the static links. Does Google really choke on that? All I really need is for Google to make it INSIDE...in to the content pages where I don't have any DHTML. No scripting. Just plain text and simple graphics. If I can point Googlebot to that stuff, I know it'll be happy.
Cheers,
Crow
<a href="/department/">Department</a>
- relative links from one level to a subfolder of the same level (eg. from the index page to a first level folder like the one above) do not really need the slash in front of them, so personally i never do it like this, although i guess it's perfectly valid.
You could write it like this in stead, and you would save the User-Agent the one step of locating your document root (which it is on already) before moving on to request the sub-folder:
<a href="department/">Department</a> In general, you should always be a bit careful with up-stream indicators in relative links (like "/" and "./" and "../" and "../../"). Afaik, Gbot is quite good at following them but some of the other bots tend to get lost occasionally.
It's just a minor thing but it saves you one extra character for each link on the indexpage after all. Keep the slash after the directory name though, as otherwise your server will have to do a 302 redirect each time the directory is requested without it ;)
/claus
If the link you are refering to is in the links at the top or bottom (header or footer), the slash before the link enables me to use includes so that every page's links will work no matter where someone might move them. I also try to stay away from ../ since it will break if anyone decides to relocate the file to a directory on a different level.
I'll have a look and see if there are superfluous slashes that I can remove.
Cheers,
Crow
duh... obvious! Why didn't i think of that? Try requesting the page using HTTP1.0 in stead of HTTP1.1 and you'll get a 404. Then look through you logs for the entries Gbot makes. It's the same with appsci though.
/claus
[edited by: claus at 9:55 pm (utc) on Sep. 10, 2003]
Perhaps it's just a matter of wording, but i'd say it was relative, as it is (relative to) the root of the same domain the link is on. An absolute link would include the domain name and the http:// - such a link would not be relative to any domain, as it would point to the same document space, no matter which host/domain it was on.
/claus
I changed all the 301¡¯s to 302¡¯s and google bot started indexing again. It was the only change I made and now I¡¯m back on the 2nd page. If you have a bunch of 301¡¯s you might want to try to change them to 302¡¯s to see what happens.
I have no idea why G has introduced this lag-time (apart from a guess; to keep webmasters from using it as a site promotion tool) and it's clearly againts the right way to treat a 301 redirect (which is: The page you have requested is moved permanently, quit using the old address and start using the new now - it's the "now" part that Gbot does not follow)
Mohamed_E:
SamSpade is your friend, get it and you will wonder what you did before. Something was perhaps wrong with my configuration, i got these 404s anywhere i tried HTTP 1.0, so this might not have been the real issue. I still haven't figured out what went wrong, but it's on my side i think, possibly an
Error 40 ;) Powdork:
You're absolutely right. And it is possible to make lean and standards compliant code that will not only look good but also work right back to the Lynx and mosaics ("work" meaning displaying all the content, even somewhat nicely, and not crashing the browser). I believe "lean" is the most important word in this respect.
/claus
If you have a bunch of 301's you might want to try to change them to 302's to see what happens.
Please somebody explain the 301/302 issue to me.
I'm just a newbie webmaster and it doesn't say in the glossary...
I've got a site map, but it doesn't seem to matter. The Googlebot has never visited it. Just index page, robots page, then...gone.
I was at this point for a month or so with a new non-profit information site. I tried several tactics. I got more links and I "looked up" internal pages using the Toolbar.
The Bots eventually came around. Having spoken with other not-for-profit web masters, it seems that Google has such sites on a slow "first full crawl" schedule- IF you don't have excellent PageRank to start with. (I did notice G-Guys comment that you need more inbound links- that's significant) It seems that there is some sort of formula in Google's newly patented algo where the higher your PageRank the sooner and more frequently the bots hit. I guess that its a better way to allocate spider resources. Makes sense I suppose.
Since then, I've tried to seek out more inbound links and add content and now the bots' come around on a regular basis- usually once a week or more (if you count robots.txt visits). (And we're up to a toolbar rank of 4.)
The links on the site are frequently tested and verified. I don't tend to use relative links - unless it's relative to the root. ie. none of this: "../../../"
claus - the header info is very interesting. I'm going to explore this further with the sysadmin. We have found it very difficult to troubleshoot IIS (neither of us are Windows-types) and to determine if the problem is server configuration or my code.
Cheers,
Crow
i don't think that the 404 is your sammy's problem... i get it here, too, when using http 1.0...
i would almost hazard a guess that their IIS 5.0 has been tuned (or something) to not handle http 1.0 requests... however, http 0.9 gets right thru <<scratching head>>
very very very wierd...
[seconds pass]
ahha! found it...
http 0.9 and 1.1 send "GET /"
http 1.0 sends "GET /appsci.queensu.ca/"
definite difference there...
FWIW: i see the same thing when accessing my own Apache server... maybe it is a bug in sammy?
[time passes]
ahh fooey... the safebrowser at samspade.org doesn't appear to do anything other than http 1.1 ;(
What piece of software is sending this?
If G is using the same library function in the
spider, then this has implications for all of
us. Especially if urlscan is in use. In the stock
configuration, the two dots alone will cause the
request to be summarily dropped on the floor.
The above request is clearly not HTTP/1.0 compliant.
It might be HTTP/0.9 compliant, but G clearly identifies
itself as HTTP/1.0. In that case, a server is not
required to process the request in the format shown
above. In the above case, G would have requested:
GET /appsci.queensu.ca/ HTTP/1.0
As a matter of fact, that exact request made manually
via telnet brings up the standard IIS 404 page.
from rfc1945
The two options for Request-URI are dependent on the nature
of the request.The absoluteURI form is only allowed when the request is
being made to a proxy. The proxy is requested to forward
the request and return the response....
An example Request-Line would be:
GET http://www.w3.org/pub/WWW/TheProject.html HTTP/1.0...
The most common form of Request-URI is that used to
identify a resource on an *origin server* or gateway. In
this case, only the absolute path of the URI is transmitted
(see Section 3.2.1, abs_path).For example, a client wishing to retrieve the resource
above directly from the origin server would create a TCP
connection to port 80 of the host "www.w3.org" and send
the line:GET /pub/WWW/TheProject.html HTTP/1.0
notice when talking to an origin server as differentiated
from a proxy, no hostname is permitted in the request line
for HTTP/1.0.
full shared hosting support was introduced in HTTP/1.1
rfc 2616 using the host header. some clients adopted
the host header even when identifying themselves as
HTTP/1.0, eg. Netscape 4.x
The real question is what is G sending exactly?
Anyone have server logs covering this?
++++