Forum Moderators: phranque

Message Too Old, No Replies

Enter through the front door only

         

The_Kellys

11:42 am on Oct 10, 2010 (gmt 0)

10+ Year Member



Hello People,

First if anyone does reply, can I ask you to keep it simple. I hate to admit it but I find this very hard to understand.

What I would like to do is force all, or at least most, to enter my site through the front door, that is through the index.shtml file in the root of my site.

This is what I have done so far.

# Only through the front door
# Is it a request from me
RewriteCond %{HTTP_REFERER} !^http://mysite.co.uk/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://mysite.co.uk$ [NC]
RewriteCond %{HTTP_REFERER} !^http://mysite.co.uk/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://mysite.co.uk$ [NC]
# Is it through the front door
RewriteCond %{REQUEST_URI} !^\/.* [NC]
RewriteRule .* /index.shtml [R=301,L]

Its clearly wrong because it does not work every time. I arrived at the above by pinching code examples from others. I think the error is in the last but one line. I realise that users bookmarks will get screwed up by this but once the regulars get used to it I will be happy. I have just read another message and now realise I need to keep my cache clear when testing this. Done that, still does not work properly

Any help please (in as plain a language as possible)

sublime1

6:03 pm on Oct 10, 2010 (gmt 0)

10+ Year Member



Hi --

Keeping it simple, at your request:
  • The referer field is notoriously unreliable and should not be counted on
  • The first two pairs of RewriteCond's are redundant. Take away one pair.
  • Assuming you remove one of the redundant pairs, the remaining pair could be consolidated with the regular expression
    !^http://mysite.co.uk
    , which reads "not starting with
    http://mysite.co.uk
    ", which I think is the only test you need.
  • The regular expression in the last RewriteCond that is now
    !^\/?.*
    should be
    !^/(index.shtml)?$
    , meaning "unless the path either just a slash, or a slash plus index.shtml". There is also no reason (and may be harmful) to put a backslash before a forward slash as you have here.
  • I recommend that the actual rewrite rule specify the correct fully qualified domain name as in
    RewriteRule .* http://mysite.co.uk/index.shtml [R=301,L]
    .
  • Further, none of this appears to handle the decision of whether you want your domain known as www.mysite.co.uk or mysite.co.uk (for more info, search on "domain canonicalization"). In any case, if it is possible to get to your site with or without a "www." in front (as it is in most cases), then you'll also need to test for that in your RewriteCond pattern. In this case, would change from
    !^http://mysite.co.uk
    to
    !^http://(www\.)?mysite.co.uk



I would, however, be remiss if I didn't ask: are you sure? Really, really sure?

Redirecting all external requests to your home page will nearly guarantee that your site never appears in search engines (a goal that is foreign, antithetical and downright sickening to most all reader here :-). Is this what you want? (It's fine if so.)

Further, all usability studies suggest that doorway pages are an annoyance that people don't want, and are a sure way to drive people away -- while common in the early days of the web, you don't see them that often any more. That's either because sites don't want to be seen, or because they do, but have been ignored by search engines.

Finally, as I said earlier, the "referer" field is often not present in requests so should not be used to guarantee this outcome -- setting a session cookie with a short expiration is a more reliable method (although still not bulletproof): if the visitor doesn't have a current cookie, redirect them. Ask a separate question if you need help with how to do this, and test for it, etc.

If I have made you think twice about proceeding with this path, or if you feel you have a good reason to do it, I would encourage you to describe what it is that you're trying to accomplish, as I suspect there are better ways than the route you're on now.

My experience is that most people here are happy to help, but understanding why the problem is being solved is a key to finding the best solution.

Tom

The_Kellys

6:57 pm on Oct 10, 2010 (gmt 0)

10+ Year Member



Ah Ha....I have digested and understood just about everything you say...brilliant

I dont yet know how to do what you say about cookies but will certainly see if I can implement it, if not I will return.

About Bots, After I wrote my message I realised I had left out the bit about google, yahoo, bling or whatever its called. I intend to put in statements (in English): AND user agent is NOT google and so on so that they can go and do their thing.

My site such as it is is absolutely hammered by google and yahoo and visited often by microsoft and all without any advertising. I have no idea why.

Why I want to try this out is because of lots of links from other websites. Perhaps 75% of the visits to my site are direct to a page from a link, grab what they want and then gone. They cant be bothered to look at the rest. I ran a bulletin board for several years before the internet as we know it today and that type of usage was considered very bad manners. My site has no commercial value at all and does not run any kind of advertising.

So even though it may drive a few of them away I would like to give it a go. At the end of the day its just a family web site that managed to attract a seemingly never ending trickle of one-time visitors. Most of them appear to come from links on other websites and a few dont appear to have a referrer at all...I expect that they are the less desirable type of visitors.

Anyway, I like tinkering :)

Thank You for your help and your argument, very well put together.

Best Wishes, John

The_Kellys

7:15 pm on Oct 10, 2010 (gmt 0)

10+ Year Member



Update, Well I am most impressed. It seems I did understand you. It Works !

This is how it ended up

# Only through the front door
# Is it through the front door
RewriteCond %{REQUEST_URI} !^/(index.shtml)?$ [NC]
# Is it a request from me
RewriteCond %{HTTP_REFERER} !^http://(www\.)?mysite.co.uk [NC]
# Is it a known Bot
RewriteCond %{HTTP_USER_AGENT} !^Google [NC]
RewriteCond %{HTTP_USER_AGENT} !^Slurp [NC]
# Redirect them if not via front door
RewriteRule .* [mysite.co.uk...] [R=301,L]

I am going to leave it as is for the time being and watch the logs...I have to add a few more lines for the allowed bots.

Very pleased with your help.

Best Wishes, John

sublime1

7:28 pm on Oct 10, 2010 (gmt 0)

10+ Year Member



(Update; just read your reply that you have your plan working now. Here's one last attempt to dissuade you, even with the exceptions for Yahoo and Google)

John --

Bling
-- good one :-)

The world of linking has changed. "Deep linking" used to be frowned upon as some kind of attempt to subvert the pageviews needed to support a site's advertising. Websites were personal. Bandwidth and server resources were expensive.

But the world is very different today. Banner ads and pageviews/impressions are less prevalent -- affiliate links and adsense more. The controversy about deep linking was over 10 years ago, by my count. And any decorum or manners probably long before that (at least that was my Usenet experience!).

More important, the number of links a site has is a key factor in how all search engines evaluate the quality of a page, but also whether they "trust" a site. Real links are good if you want to get listed on search engines.

So maybe all you care about is real visitors who come for noble and purposeful reasons. As long as you don't make any money and never plan to, then your strategy is fine.

But if you like to tinker (as I do), spend some quality time using Google Analytics, and Google Webmaster tools. Cozy up to Webmaster World and read more. You should understand why people are coming to your site, reading a page and then leaving.

Perhaps you are an authority on a given topic and have given them everything they need to know in a single page. Or maybe they are just finding that your page or site holds no more interest for them.

I have a page on my personal blog that has turned into one of the authoritative resources on why Windows XP will not go into standby mode -- I started it, my readers added hundreds of comments. I get a lot of traffic to that page, and an exceptionally high (> 65%) of readers "bounce", meaning they read only that page. The blog is a personal thing: if I can get people what they want with this page, that works for me. We had a similar issue on a site whose goal is very much "make money" -- and there, our bounce rates started high, and have been falling ever since.

Before you kill your site, learn why you're getting the visitors you're getting. There are always ways to change your site to make them do more of what you want, but you have to know what's happening first.

Tom

jdMorgan

4:12 pm on Oct 15, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The patterns for googlebot and slurp are incorrect, in that they are start-anchored. Have a look at your raw server logs to see the actual user-agent strings, then either specify those exact strings or remove the start-anchors.

Also, many requests that appear to be from search engine spiders are actually from content-scrapers. You can deny these accesses by checking the REMOTE_ADDR IP address for know spider IP addresses, or by checking the reverse-DNS to make sure that the requesting IP address resolves to a known search engine hostname. This subject has been well-covered here in the past, so try a search.

Be aware that users of AOL, EarthLink and other ISPs and corporations that use caching proxies in their networks to reduce external bandwidth demnds will be unable to get to anything on your site except the home page. They will not be able to load any images, stylesheets, or external JavaScripts from your server.

One way to fix that is to allow blank referrers by changing the referrer exclusion to
 RewriteCond %{HTTP_REFERER} !^(http://(www\.)?mysite\.co\.uk.*)?$ [NC] 


Another way to do it is to by-pass the access-control rule if the request is not for a "page." But that may be counterproductive, since it will allow hotlinking to images, CSS, and JS files on your site.

The only really proper way to do what you're trying to do here is to have the home page set a cookie, and then check for that cookie on all non-home-page page requests. If the cookie is set, that means that the visitor has seen your home page and should be left alone. If it is not set, then they came in to a deep page directly, and you can redirect to the home page.

On Apache 1.x, you'll need a script to set cookies. On Apache 2.x, you can use mod_rewrite to do it. Either version of Apache can test cookies, though.

Also note that it's generally advised to redirect to example.com/ and not to example.com/index.xyz -- There is no reason to tie yourself to a technology-specific home-page URL, setting yourself a trap for the future when you may wish to go to .php or something else. Just redirect to "/" and let mod_dir take care of rewriting to your DirectoryIndex file.

Again, don't forget to add a domain canonicalization rule after this rule-set.

Jim

The_Kellys

6:08 pm on Oct 15, 2010 (gmt 0)

10+ Year Member



I took so long answering this your system appears to have thrown my reply away.

Anyway, Thank you very much for your comments. There were one or two problems develop after a while and I have suspended the code because I have little time to try & fix it. I fully expect your comments to be the solution. I will have to slowly work through it and try again. I'm a little slow myself nowadays so don't expect an update for a while.

I am pretty convinced over your comments (and sublime1) about cookies the logic is inescapable. I will be back to get some help on how to write a cookie and check for it.

Thank you for your most valuable help with this.