rewrite and referer-data - Apache Web Server forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

rewrite and referer-data

Oliver Henniges

11:00 pm on Feb 22, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Surely often asked before, but I couldn't find concise answers by the webmasterworld-search-functon on this issue:

Recently I have begun to log IP and referer of my visitors as an initial step of each session. As a matter of fact, a relatively high amount (20%) of the logged "referers" came from www.mydomain, which shouldn't be the case initially. The referrer of first request of a given session should be either empty or faked or from a site other than mine.

Does rewrite preserve or does it change the referer information sent by the browser? My htaccess-syntax is:

RewriteCond %{HTTP_HOST} ^mydomain\.de [nc]
RewriteRule (.*) [mydomain.de...] [R=301,L]
RewriteCond %{HTTP_HOST} ^myaliasdomain\.de [nc]
RewriteRule (.*) [mydomain.de...] [R=301,L]
RewriteCond %{HTTP_HOST} ^www.myaliasdomain\.de [nc]
RewriteRule (.*) [mydomain.de...] [R=301,L]

Meanwhile EVERY single page on my site initially performs the following php-code (among other steps)

if (!isset($_SESSION['PHPSESSID'])){
$_SESSION['PHPSESSID'] =session_id();
$sid = $_SESSION['PHPSESSID'];

$sql ="replace into mylogs
(sid,referer,ip...)
values
('$sid',..,..,...)";
mysql_query($sql);
}

Are there any easy means to modify .htaccess for my needs, (if not how else can preserve the referer data?) or should I really worry about so many spoofed requests?

Any help is well appreciated.

gergoe

3:45 am on Feb 23, 2008 (gmt 0)

10+ Year Member

If a visitor goes to mydomain.de for example it will get redirected to www.mydomain.de. What happens (transparently) is:

Browser requests mydomain.de, referer: www.example.com
Server responds with a redirect: go to www.mydomain.de
Browser requests www.mydomain.de; referer: mydomain.de
Server processes your php file, but you don't see the original referer anymore.

If you really want to preserve this information, you can make a small php file which saves this information, and then redirects the browser, so instead of making (external) redirects in mod_rewrite, make an (internal) rewrite to your referer logging script, which will update the database, and then issues a redirect (see [php.net ]).

By the way, your current rules could be simplified by using the [OR] flag of RewriteCond, like this:

RewriteCond %{HTTP_HOST} ^mydomain\.de [NR,OR] 
RewriteCond %{HTTP_HOST} ^myaliasdomain\.de [NC,OR] 
RewriteCond %{HTTP_HOST} ^www.myaliasdomain\.de [NC] 
RewriteRule (.*) http:/

/www.mydomain.de/$1 [R=301,L]

Oliver Henniges

7:10 am on Feb 23, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thx for your enlightening insights (as always on webmasterworld).

I added

$redirect = '';
if(($myhost<>'localhost') and ($myhost<>$mydomain)){$redirect = 'http://'.$mydomain.$mynewpage;header("Location: {$redirect}");}

to the end of my initialising program (after processing session variables and the database entry). In the beginning I defined

$mydomain as my preferred domain.
$mynewpage from the server-vars array as the actual page
(I need the localhost stuff for my local apache test-environment, because i dont want my local tests to be redirected, of course.)

Now skip the htaccess-redirect-entries?

We had a long discussion on this duplicate content issue. This is a massive change to ALL my webpages, if anything goes wrong. Anything else I should take care for?

gergoe

11:13 am on Feb 23, 2008 (gmt 0)

10+ Year Member

I was thinking to do the condition checking within Apache with mod_rewrite and only do the database update and redirection in php, but actually this is as good as anything else, only two problems left.

You have to make sure do not redirect when the referer is empty (not present), and that you set the proper redirect code when redirecting using the

header('HTTP/1.1 301 Moved Permanently');

parameter syntax of header, otherwise PHP sends 302 Found, which is a temporary redirection - but probably that had been covered in the previous discussion?

Make sure you check the response headers of your application with some tool, then you can be always sure on how it will behave, and eventually see if there's a problem or not.

[edit]If you continue on this way (implementing everything in php), then you don't need the mentioned mod_rewrite directives indeed.[/edit]

Oliver Henniges

5:18 pm on Feb 23, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thank you very much for your assistance, gergoe, particularly for the 301-hint, because otherwise I had left the mere location-header, which would have set a 302 by default.

However

> do not redirect when the referer is empty

seems inappropriate to me, because particularly googlebot comes along with a zero referer entry and thus would not be redirected, and robots (+duplicate content indexing) are the main reason why all this redirection has been implemented.

> Make sure you check the response headers of your application with some tool

How can I do that? I'm a noob in some way and not too familiar with those older tools like telnet;)

jdMorgan

9:18 pm on Feb 23, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

For header checking, I suggest the "Live HTTP Headers" add-on for Firefox/Mozilla browsers.

One more simplification and a correction to the mod_rewrite, although I'm not sure you're still using it:


RewriteCond %{HTTP_HOST} ^example\.de [[b]NC,[/b]OR]
RewriteCond %{HTTP_HOST} [b]^(www\.)?[/b]myaliasdomain\.de [NC]
RewriteRule (.*) http://www.example.de/$1 [R=301,L]

Jim

Oliver Henniges

6:38 am on Feb 24, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

> although I'm not sure you're still using it:

;)

> One more simplification..

I always admired people, who mastered the regex syntax. I tried again and again, and if I see a notation, I do roughly understand what it is about, but whenever I try to write a line on my own I get comlete rubbish. Seems I'm getting old, I'm afraid...

gergoe

2:27 pm on Feb 24, 2008 (gmt 0)

10+ Year Member

However
> do not redirect when the referer is empty
seems inappropriate to me, because particularly googlebot comes along with a zero referer entry and thus would not be redirected, and robots (+duplicate content indexing) are the main reason why all this redirection has been implemented.

Sorry, it's a typo actually, I meant the host header. So to be on the safe side, if the HTTP_HOST is empty, skip the redirection - although it's unlikely that it will ever happen.

Achernar

5:46 pm on Feb 24, 2008 (gmt 0)

10+ Year Member

Top Contributors Of The Month

Note that the referer value is preserved by http redirects (301, 302).
Your problem might be with users not accepting cookies, and a new session being started on every php page.

Questions:
Do you have a common referer string/syntax when the domain of the referer is your site?
Is it pointing to one particular page? Is this page a dynamic page, and does it start the session?

Oliver Henniges

1:44 pm on Feb 25, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

> Note that the referer value is preserved by http redirects (301, 302).

thx for the info.

> Your problem might be with users not accepting cookies

You may be right. What I actually noticed after implementing that log-script was, that a high amount of requests from one specific IP had my own [www-domain...] as the referer. I assume that this was a robot not accepting cockies. What, precisely, happens if a browser or script not accepting cookies requests one of my pages? A friend once told me, that in such cases the session ID is automatically added as a get variable to the url-string, but I never verified this.

>Do you have a common referer string/syntax when the domain of the referer is your site?
>Is it pointing to one particular page? Is this page a dynamic page, and does it start the session?

As I said, I only had the above lines in my htaccess-file. I now skipped these, and perform the appropriate redirect using my php-script. All my html-pages run through the php-parser and have an include_once ('this-tracking-script.php'); - line in the beginning. I'm quite sure from the logic of my script that this redirect is independent from my session-managemaent and any cockie being accepted or not.

Achernar

4:07 pm on Feb 25, 2008 (gmt 0)

10+ Year Member

Top Contributors Of The Month

from one specific IP had my own [www-domain...] as the referer

With a path after the domain name? Or only the domain name (with or without an ending "/") ?

Oliver Henniges

8:23 pm on Feb 25, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

A pity I can't tell. I deleted those entries from the testing period meanwhile, because I want clear data after March, 1st(:

I think it was the mere domain without any path. It was with www but I cannot tell about the ending "/".

jdMorgan

9:56 pm on Feb 25, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

> A friend once told me, that in such cases the session ID is automatically added as a get variable to the url-string, but I never verified this.

That is alarming, if it's true; You should never give sessions to search engine spiders as "GET" parameters, unless you want massive duplicate-content problems. If the SE spider sees a different URL+query every time it fetches a page, then you may expect that page to be indexed under dozens or even hundreds of URLs, and to suffer the resulting dilution of PageRank/link-popularity.

Or, if you sufficiently annoy or confuse the spider, an actual "duplicate-content penalty" (as opposed to the many imaginary ones we hear about).

Your site must not require spiders to accept cookies, and it must allow them to crawl without session data in the URL.

Jim

gergoe

12:40 am on Feb 26, 2008 (gmt 0)

10+ Year Member

Concerning the preserving of the referer information. The behavior is not defined in the HTTP specification(s), so you can not expect it to work in all cases. Some browsers (or versions) might preserve it indeed (the clever one), the others might take the specifications very strict and might use the source url of redirection. The only way to make sure which one applies is checking it on most browsers (or searching for this information on the web).
By re-reading your original post I noticed you mention www_mydomain_de as the referers (which seems to have passed my attention, don't know why I though it was one of the "aliases"), and in this case your problem is indeed related to sessions/cookies, as it was pointed out by Achernar.

Concerning the session id; In some php releases this behavior was indeed enabled by default (switched to passing the session id in the query string when the cookies was not accepted), but since a long while it defaults to disabled, so you will not see this behavior. You may want to check the session.use_only_cookies, session.use_trans_sid and url_rewriter.tags PHP ini settings for more information on this subject. Besides, it is indeed not advised to enable this "fall back method", it has some security drawbacks as well besides of the SEO ones.

I think the best bet for logging referer data is not based on cookies and sessions, but simple comparison, if the referer is not one of your domain names, then save it, otherwise discard it - unless you want to do some in-depth analysis of your traffic, see where your visitors come from, which pages they checked in which order, and from which page they left your website.

Oliver Henniges

12:15 pm on Feb 26, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

thx for the additional info to both of you.

...If the SE spider sees a different URL+query every time it fetches a page, then you may expect that page to be indexed under dozens or even hundreds of URLs, and to suffer the resulting dilution of PageRank/link-popularity.
Or, if you sufficiently annoy or confuse the spider, an actual "duplicate-content penalty" (as opposed to the many imaginary ones we hear about).

the more I think about all this, the more I get confused (see also my question in this thread [webmasterworld.com]): For instance, take this URL:

[webmasterworld.com...]

What prevents a competitor from spreading hundreds of such nonsense-backlinks with varying get-parameters all over the web and thus lead google and other spiders trying to parse and index it?

back to topic:

Those mysterious entries in my database persist, but I found out I have to learn a number of basics on session-management and coockies first, before continuing this thread with questions that have been answered again and again in other threads. Maybe I'll come back in a few days. For instance I made some few tests with my own browser having disabled coockies, and bingo, there they are.

Are there any means at all to run a shop system without client side scripting for visitors, who -for whatever reason- disabled coockies?

Again, thank you very much for your assistance.

Oliver Henniges

7:06 pm on Feb 28, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

OK. This morning I programmed a database-driven-workaround for all visitors with cookies disabled. It seems to work fine, they all receive only one session-ID now, and I did NOT use the GET-method due to the security risks mentioned.

I also made a final check with my new system directly typing in my not-preferred-domain in the browser bar, with no effect.

But I still got three entries from today's afternoon showing mypreferreddomain as the referrer.

It's not due to the redirect.
It's not due to corrupt session management.

All spoofed?
A hint that I have been hijacked? (Number of Visitors didn't grow as it used to since June last year, but google shows an adequate number of pages indexed on searching for site.www.mydomain.)
Any other ideas?