Forum Moderators: phranque
Recently I have begun to log IP and referer of my visitors as an initial step of each session. As a matter of fact, a relatively high amount (20%) of the logged "referers" came from www.mydomain, which shouldn't be the case initially. The referrer of first request of a given session should be either empty or faked or from a site other than mine.
Does rewrite preserve or does it change the referer information sent by the browser? My htaccess-syntax is:
RewriteCond %{HTTP_HOST} ^mydomain\.de [nc]
RewriteRule (.*) [mydomain.de...] [R=301,L]
RewriteCond %{HTTP_HOST} ^myaliasdomain\.de [nc]
RewriteRule (.*) [mydomain.de...] [R=301,L]
RewriteCond %{HTTP_HOST} ^www.myaliasdomain\.de [nc]
RewriteRule (.*) [mydomain.de...] [R=301,L]
Meanwhile EVERY single page on my site initially performs the following php-code (among other steps)
if (!isset($_SESSION['PHPSESSID'])){
$_SESSION['PHPSESSID'] =session_id();
$sid = $_SESSION['PHPSESSID'];
$sql ="replace into mylogs
(sid,referer,ip...)
values
('$sid',..,..,...)";
mysql_query($sql);
}
Are there any easy means to modify .htaccess for my needs, (if not how else can preserve the referer data?) or should I really worry about so many spoofed requests?
Any help is well appreciated.
If you really want to preserve this information, you can make a small php file which saves this information, and then redirects the browser, so instead of making (external) redirects in mod_rewrite, make an (internal) rewrite to your referer logging script, which will update the database, and then issues a redirect (see [php.net ]).
By the way, your current rules could be simplified by using the [OR] flag of RewriteCond, like this:
RewriteCond %{HTTP_HOST} ^mydomain\.de [NR,OR]
RewriteCond %{HTTP_HOST} ^myaliasdomain\.de [NC,OR]
RewriteCond %{HTTP_HOST} ^www.myaliasdomain\.de [NC]
RewriteRule (.*) http://www.mydomain.de/$1 [R=301,L]
I added
$redirect = '';
if(($myhost<>'localhost') and ($myhost<>$mydomain)){$redirect = 'http://'.$mydomain.$mynewpage;header("Location: {$redirect}");}
to the end of my initialising program (after processing session variables and the database entry). In the beginning I defined
$mydomain as my preferred domain.
$mynewpage from the server-vars array as the actual page
(I need the localhost stuff for my local apache test-environment, because i dont want my local tests to be redirected, of course.)
Now skip the htaccess-redirect-entries?
We had a long discussion on this duplicate content issue. This is a massive change to ALL my webpages, if anything goes wrong. Anything else I should take care for?
You have to make sure do not redirect when the referer is empty (not present), and that you set the proper redirect code when redirecting using the
header('HTTP/1.1 301 Moved Permanently'); parameter syntax of header, otherwise PHP sends 302 Found, which is a temporary redirection - but probably that had been covered in the previous discussion? Make sure you check the response headers of your application with some tool, then you can be always sure on how it will behave, and eventually see if there's a problem or not.
[edit]If you continue on this way (implementing everything in php), then you don't need the mentioned mod_rewrite directives indeed.[/edit]
However
> do not redirect when the referer is empty
seems inappropriate to me, because particularly googlebot comes along with a zero referer entry and thus would not be redirected, and robots (+duplicate content indexing) are the main reason why all this redirection has been implemented.
> Make sure you check the response headers of your application with some tool
How can I do that? I'm a noob in some way and not too familiar with those older tools like telnet;)
One more simplification and a correction to the mod_rewrite, although I'm not sure you're still using it:
RewriteCond %{HTTP_HOST} ^example\.de [[b]NC,[/b]OR]
RewriteCond %{HTTP_HOST} [b]^(www\.)?[/b]myaliasdomain\.de [NC]
RewriteRule (.*) http://www.example.de/$1 [R=301,L]
;)
> One more simplification..
I always admired people, who mastered the regex syntax. I tried again and again, and if I see a notation, I do roughly understand what it is about, but whenever I try to write a line on my own I get comlete rubbish. Seems I'm getting old, I'm afraid...
However> do not redirect when the referer is empty
seems inappropriate to me, because particularly googlebot comes along with a zero referer entry and thus would not be redirected, and robots (+duplicate content indexing) are the main reason why all this redirection has been implemented.
Sorry, it's a typo actually, I meant the host header. So to be on the safe side, if the HTTP_HOST is empty, skip the redirection - although it's unlikely that it will ever happen.
Questions:
Do you have a common referer string/syntax when the domain of the referer is your site?
Is it pointing to one particular page? Is this page a dynamic page, and does it start the session?
thx for the info.
> Your problem might be with users not accepting cookies
You may be right. What I actually noticed after implementing that log-script was, that a high amount of requests from one specific IP had my own [www-domain...] as the referer. I assume that this was a robot not accepting cockies. What, precisely, happens if a browser or script not accepting cookies requests one of my pages? A friend once told me, that in such cases the session ID is automatically added as a get variable to the url-string, but I never verified this.
>Do you have a common referer string/syntax when the domain of the referer is your site?
>Is it pointing to one particular page? Is this page a dynamic page, and does it start the session?
As I said, I only had the above lines in my htaccess-file. I now skipped these, and perform the appropriate redirect using my php-script. All my html-pages run through the php-parser and have an include_once ('this-tracking-script.php'); - line in the beginning. I'm quite sure from the logic of my script that this redirect is independent from my session-managemaent and any cockie being accepted or not.
from one specific IP had my own [www-domain...] as the referer
That is alarming, if it's true; You should never give sessions to search engine spiders as "GET" parameters, unless you want massive duplicate-content problems. If the SE spider sees a different URL+query every time it fetches a page, then you may expect that page to be indexed under dozens or even hundreds of URLs, and to suffer the resulting dilution of PageRank/link-popularity.
Or, if you sufficiently annoy or confuse the spider, an actual "duplicate-content penalty" (as opposed to the many imaginary ones we hear about).
Your site must not require spiders to accept cookies, and it must allow them to crawl without session data in the URL.
Jim
Concerning the session id; In some php releases this behavior was indeed enabled by default (switched to passing the session id in the query string when the cookies was not accepted), but since a long while it defaults to disabled, so you will not see this behavior. You may want to check the session.use_only_cookies, session.use_trans_sid and url_rewriter.tags PHP ini settings for more information on this subject. Besides, it is indeed not advised to enable this "fall back method", it has some security drawbacks as well besides of the SEO ones.
I think the best bet for logging referer data is not based on cookies and sessions, but simple comparison, if the referer is not one of your domain names, then save it, otherwise discard it - unless you want to do some in-depth analysis of your traffic, see where your visitors come from, which pages they checked in which order, and from which page they left your website.
...If the SE spider sees a different URL+query every time it fetches a page, then you may expect that page to be indexed under dozens or even hundreds of URLs, and to suffer the resulting dilution of PageRank/link-popularity.Or, if you sufficiently annoy or confuse the spider, an actual "duplicate-content penalty" (as opposed to the many imaginary ones we hear about).
the more I think about all this, the more I get confused (see also my question in this thread [webmasterworld.com]): For instance, take this URL:
[webmasterworld.com...]
What prevents a competitor from spreading hundreds of such nonsense-backlinks with varying get-parameters all over the web and thus lead google and other spiders trying to parse and index it?
back to topic:
Those mysterious entries in my database persist, but I found out I have to learn a number of basics on session-management and coockies first, before continuing this thread with questions that have been answered again and again in other threads. Maybe I'll come back in a few days. For instance I made some few tests with my own browser having disabled coockies, and bingo, there they are.
Are there any means at all to run a shop system without client side scripting for visitors, who -for whatever reason- disabled coockies?
Again, thank you very much for your assistance.
I also made a final check with my new system directly typing in my not-preferred-domain in the browser bar, with no effect.
But I still got three entries from today's afternoon showing mypreferreddomain as the referrer.
It's not due to the redirect.
It's not due to corrupt session management.
All spoofed?
A hint that I have been hijacked? (Number of Visitors didn't grow as it used to since June last year, but google shows an adequate number of pages indexed on searching for site.www.mydomain.)
Any other ideas?