Forum Moderators: Robert Charlton & goodroi
I recommend absolute links instead of relative links, because there's less chance for a spider (not just Google, but any spider) to get confused. In the same fashion, I would try to be consistent on your internal linking. Once you've picked a root page and decided on www vs. non-www, make sure that all your links follow the same convention and point to the root page that you picked. Also, I would use a 301 redirect or rewrite so that your root page doesn't appear twice. For example, if you select http://www.example.com/ as your root page, then if a spider tries to fetch http://example.com/ (without the www), your web server should do a permanent (301) redirect to your root page at http://www.example.com/
I understand we need to decide on using the www or not. I'm going with www so I write http://www.example.com but what I'm not so sure about is when I refer to a page in the same directory. I usually write something like <A HREF="patterns.htm"> to link to that page in the widgets directory. Here is the part I'm not certain about. If I change the links to <A HREF="/patterns.htm"> with the slash have I made it absolute or do I have to write out the whole http://www.example.com/patterns.htm to be safe?
Also I'm on a shared server with a small company who uses apache servers. How do I write the 301?
[edited by: ciml at 10:56 am (utc) on June 13, 2005]
[edit reason] Examplified [/edit]
On the redirect I had to ask my server company to change it for me. Apparently there wasn't way to do it myself. It would still be good to get some information on this thread on how to do it though as I suspect I'm not the only one on this forum who doesn't know all the technical ins and outs.
Thanks to Helleborine there is some good information on how to discover if you have been hijacked at "302 Hijacks for Dummies". If you are concerned that you might have been hijacked just sticky me for the URL. I did find one hijack on my site.
If you have a search/replace function in whatever you use to make webpages, you should be able to replace what you have with the absolute links in not too long a time.
I haven't personally done this!
Front Page has a dialog to set this, but I think you'd have to go to every page, at least in my antiquated version
Why would you want to use the BASE HREF metatag if all your links are absolutes? I'm confused about that.
If I made BASE HREF="/" for all my pages wherever they are what would my links look like? Or should I make my BASE HREF equal to the current folder of the page?
Does this apply to IMG SRC's too?
It's just one line of HTML you put in your HEAD section I believe, instead changing hundreds of relative links to absolutes.
Since this documents where your page should have originated, it can help identify HiJacks. Think about it, each page itself does not document its own location or source without this.
As for non-www vs www.yoursite.com, I definitely prefer the with www. Why? Not really sure here.
Non-tech-savvy visitors have come to expect the www, and I feel better providing it.
GG gave what I think is really good advice (see quote above) and I follow it to the letter. -Larry
If I didn't know better I would think Google is asking us to do this as its having problems identifying which sites certain pages should belong to - e.g. when several domains point the same hosting account/website.
My examples are going to assume you know how to do "A" tag with "href=" stuff, so we don't mess with the forum rules. I'll give examples of what you can do:
"http://example.com/" or "http://www.example.com/" or "http://foo.example.com" or http://example.com/directory/file.htm" are all absolute URLs. -- they contain a "protocol" ("http://") and a domain name (e.g., "example.com") and possibly a file name. (If there is no file name, most servers know what file to deliver.)
"http://" is not the only possible protocol -- commerce servers use "https://"; "ftp://" and others are used for specialized purposes. But any URL that starts with a protocol is absolute.
If the protocol is missing, then the URL is relative. Its location is based on a BASE, which is normally the directory that the current page resides in.
"page2.htm" is contained in the same directory as the current page.
"subdir/page2.htm" is contained in a subdirectory, "subdir" or the directory containing the current page.
"../page2.htm" is contained in the PARENT directory of the directory containing the current page.
"/page2.htm" is contained in the root directory of your website -- which may be the same as the directory containing the current page, or it may be a parent of a parent of a parent of a parent of it.
The BASE meta tag merely says, "for purposes of relative URLs, don't use the directory this file is in -- use this other directory instead.
I like relative URLs. Googleguy is right to not trust spiders, but ... any spider too stupid to handle relative URLs well won't survive on today's web anyway. And even google.com doesn't use exclusively absolute URLs.
The advantage of relative URLs is that you have a great deal of freedom to move website images around on your local machine or your server, or both, from one directory to another, and links to "neighbor" pages still work, without any fancy relocating tools (which, in my experience, are not sophisticated enough to handle, say, javascript-generated links -- one of my favorite techniques for some links that spiders don't need to follow, or arent' supposed to follow.)
And ... I've worked on too many very large programming projects, where careless absolute "include" directives made it very difficult to pick up someone else's project. Absolute file links are (in my experience) just flat evil, and I only have one brain to use for both C++ and HTML coding.
Note that absolute links are absolutely no protection from page-scrapers for non-javascript-generated links. It is a trivial matter for a perl program to check for a BASE directive, and modify it to point to the scraped mirror. Likewise, hard-coded absolute links can be easily detected and modified to make absolute links to the scraped pages. And just because the scrapers aren't doing this yet, doesn't mean they won't be doing it later this afternoon.
Javascripted links have disadvantages: people who are forced to use IE and are yet concerned about security will have to turn off Javascript, because Microsoft's version is so badly misdesigned for security. BUT: JAVASCRIPTED ABSOLUTE LINKS ARE ALMOST CERTAINLY SAFE FROM ALL LIKELY SCRAPER SCENARIOS.
Link spiders won't be your problem for unscripted links, whether relative or absolute; javascripted links are safe from link spiders and page scrapers, but can be damaged by many website mastering tools.
That's your options. Season to taste.
A couple of comments on your 2 questions:
do I have to write out the whole http://www.example.com/patterns.htm to be safe
steveb already answered this bit, but again, yes, this is absolute.
As an aside, I think good site management should use relative internal linking. However, I understand the need for absolute as it relates to the page jacking issue.
How do I write the 301?
There are a number of 301 examples in various threads here in WebmasterWorld. This is what I've gleaned from WebmasterWorld and am using on my sites:
Options +FollowSymLinks
RewriteEngine On
RewriteCond %{HTTP_HOST}!^www\.mydomain\.com
RewriteRule ^(.*) [mydomain.com...] [R=301,L]
A search for 301 or 301 non should give you some more ideas and if I've made any mistakes above, I'd welcome suggestions.
What I'm using is far more antiquated. Does anyone remember HTML notepad?
Thank goodness you're using HTML notepad and not Frontpage :) I understand you probably don't have the facility with this to do massive global replacement. I use a very old version of Homesite (predecessor of DW I think?) which does global replacements across multiple files and directories very nicely (with regex's if needed).
You may want to search around for another utility to do this for you. I don't know how big your site is, but I don't think I'd want to be doing this by hand. I expect you, like myself, aren't a "unix person"; there are probably a bunch of utilities to do it under unix as well if you can get someone to help you locally.
Hope that is of some assistance.
Regards,
Jim
anyone using relative linking should have a base meta tag on each page. This just confirms to the user-agent where they are now, so that they don't get lost. Kind of redundant since the user-agent had to have the URL already just to get to the page but safer than letting the user-agent assemble URL's out of thin air.
The base meta tag can also inject [www....] into every request.
whatever method you choose, as googleguy suggested, should be consistant throughout the site.
some virtual hosts won't allow you to 301 redirect non-www to www. They use the non-www with a 302 redirect for tracking. This is where the base meta tag (with the canonical [)...] can come in handy. Basically if any user-agent arrives at the page with a non-canonical url - or the non-www version, the base meta tag will force it to use the [www....] URL.
Thanks for that. It has helped, but it wasn't the post I was recalling. Maybe it wasn't by GG, or maybe even it was on another forum, but I remember something about approx 4 or 5 point plan for making your site less vulnerable to 302 hijacks, which included all those points made by GG in that post.
I've done the 301 redirect from non-www to www. I was looking for a (that) check-list of what else to do.
GG didn't seem so perturbed about the absolute links issue and didn't seem to mention it having an effect on 302 hijacks. So just how does not having absolute links make you more vulnerable to hijacking?
If you have absolute hrefs, and no 301 redirect your site can still be split apart but it requires more work on someones part because they can't enlist the search engines bot to shred the site. It will stop on the page it starts on. Can still be done though, just a bit more work.
With proper 301s it can't be split in that manner, regardless of the type of hrefs used..
The intent of the splitting is to degrade your site by triping various Google gotchas like massive duplicate content etc.
I do believe that someone has admitted that only a site that is degrading can have a page hijacked.
Of course the proper inital setup of your server should only allow one valid server alais to be visable. No split is possible.
Slightly different number of pages reported for site:www.aaaa.com versus site:aaaa.com, on Yahoo
I still have more pages properly indexed by Yahoo than Google, although before March Google had 100% of my pages indexed for many months, but now that stands at 65% indexed 35% URL only. Don't have a clue why. Many URL only pages are smaller, but some have lots of content, and still are URL only.
My host uses a linux server. What is the best way to implement the 301? Have people had problems after they did the 301 redirect?
Yep that was exactly the thread I'd been looking for. Thanks. Not only does it have the message I'd been looking for but I learnt some more from reading all the other messages in that thread. (I'd only just discovered wemasterword the day I'd read that thread and hadn't realised the significance of it.. I'd kinda been overwhelmed by the volume of information here... getting a bit more used to it now).