Forum Moderators: DixonJones
This is my first post to the forum. I just found it the other day and have enjoyed reading posts for a week or two. I hope this post is in the right forum group.
I am in the process of changing my site to require the acceptance of cookies from visiting browsers.
At my site, Perl scripts create the HTML page served up by combining HTML content and HTML structure together. In order to determine if a visiting browser has cookies enabled I have had to interject a call to a cookie setting script that will then redirect to my other HTML building scripts. The cookie setting script sets a cookie. The HTML building scripts check to see if the cookie was accepted before serving up the usual HTML combined page.
Anyway it's working well (though I have not yet uploaded all the changes to my actual site) except for one thing..
My site will no longer allow visits from browsers or user agents that do not accept cookies.
But I DO want to allow visits from spiders, search engine robots, and other such user agents.
Will spiders and robots accept cookies and return them?
If not will my strategy mean that all such spiders will cease to spider my site content? When my site rejects them and keeps redirecting them to a page asking them to accept cookies? To which they will obviously not respond?
Will I have to include a pass through list of spiders in my cookie setting code so as to let them in? Is there an easier way? If not what is the most effective strategy for letting the most valuable spiders in?
Any insight on any or all of the above questions would be very appreciated. What I am really looking for is just some direction and not so much step by step instructions on how to do things.
Thanks.
Carlos
I was just thinking about these very issues last night.
I understand evertything that you are doing, but I just don't understand the motivation.
Is there a specific reason you want 100% cookied? It would strike me that just allowing access to spiders would instantly overwhelm this approach.
Here's something I did in PHP, maybe it will give you an idea of what to do in perl?
[pre]
<?
/* Use this to start a session only if the UA is *not* at search engine
to avoid duplicate content issues with url propagation of SID's */$searchengines=array("Google", "Fast", "Slurp", "Ink", "ia_archiver", "Atomz", "Scooter");
$is_search_engine=0;
foreach($searchengines as $key => $val) {
if(strstr("$HTTP_USER_AGENT", $val)) {
$is_search_engine++;
}
}
if($is_search_engine==0) { // Not a search engine
/* You can put anything in here that needs to be
hidden from searchengines */
session_start();
} else { // Is a search engine
/* Put anything you want only for searchengines in here */
$foo=$bar;
}
?>
[/pre]
Nick
But jatar_kīs point is good. Why enforce the use of cookies for just accessing the pages? Opening up your pages for spiders will allow anybody who wants to surf the site without cookies to do so by changing the User-agent string.
Donīt force people to use cookies but provide some additional benefits when people actually use cookies. Then itīs up to them to decide whether the additional benefits make up for the lack in privacy.
Let me address the reason for cookies....
The problem I was having was trying to come up with some idea as to how many new visitors I was actually getting from my Apache web logs. "Hits" just didn't cut it for reasons that everyone here probably knows about.
I needed some way to "tag" a visitor so that if they went off to other sections of my site and looked around they would not be counted as a new visitor for every page they saw.
I also decided to do my own page access logging recording stats on whether visitors had cookies enabled, Javascript turned on, and other such things. As a way to help me use such technologies to better enhance my site in the future.
When I came around to recording the HTTP_REFERER field I noticed and realized that many times this field was empty. Again I needed a way to record what page a visitor was on and what page paths they took around on my site. So that I could better target my offerings or revise my pages to improve the overall site.
So I once again decided to use the cookie as a way to create my own referer field when a visitor is visiting areas of my site.
All in all cookies just seemed like the best alternative.
If anyone is interested, I uploaded a page I have been working on explaining more about why I am using cookies <edit>URL removed</edit> Please bear in mind that my site is NOT operational and that this page has not yet been incorporated into my site or made public. It's just for those of you here who might want more info on why I am using cookies. I am still working out my cookie strategy and am open to further input on why I should or shouldn't.
Today as I was thinking about things it became clear that checking for spiders and letting only some in was a loosing proposition for my site. In terms of taking time to add valuable spiders to my list and making sure the good ones get in. Keeping the list up to date. Not to mention the slow down in serving each page that would check through an ever growing list.
It also became clear after doing some research on the User Agent field that this value is not really that super reliable and that many spiders even use names like "Mozilla" in this field. I had thought that I could limit my cookie requiring code to just those User Agents that ended up being the major browsers. Identifying them by words like Mozilla or MSIE or some such. But that didn't seem like a good approach either.
I think I have settled on just setting a cookie, checking to see if the cookie was accepted, logging the results for future reference, and letting everyone in regardless of cookies being enabled or not.
Since upwards of 90% of people browse with cookies acceptance enabled I figure my use of cookies will still be about 90% of so accurate. If so this will give me more than enough stats to determine which pages I should revise or drop, or which advertising avenues prove to be the most profitable in terms of new visitors.
For password protected areas of my site I will definitely require 100% cookies since I won't want search engines to spider those areas anyway and since cookies seem like the best way to create a session sort of ID allowing people who have successfully logged in to visit various private content.
Most of my cookies will be non-persistent by the way.
I very much like having control over how a new visitor is determined and counted and cookies seem to give me the ability to record things that I want and in the way I want.
Only thing I haven't figured out yet is how to work with proxy caches to still count new visitors when they cache my site pages :). For now I am declaring most of my pages uncacheable until I can build up some stats to give me some handle on how to improve my site.
Anyway I hope you all don't mind the long reply but in case anyone was interested I thought I would express things a bit more fully.
Carlos
[edited by: carlos123 at 12:09 am (utc) on Sep. 7, 2002]
[edited by: NFFC at 1:01 pm (utc) on Sep. 8, 2002]
[edit reason] URL removed [/edit]
You could have a login only section surrounded by htaccess that is not accessible unless signed in and have all of the public, spiderable information outside of that.
You could design your own server side tracking with perl or php.
You could get a better stats package to create better reports from your logs.
I don't know the best answer for you but I think, at the moment, you should investigate a little more before you jump into it. There are a lot of options.
Maybe I will indeed rethink my strategy some more. I thought of using session ID's in the URL but have tended to avoid these because it would seem that search engine spiders have trouble with them (unless I feed non-session ID URL's to spiders and session URL's to others).
Thanks again for the input.
Carlos
PS. By the way the URL I posted was indeed accessible but people were apparently copying the ending period along with the URL part - which made it inaccessible. I deleted the period.
By the way the URL I posted was indeed accessible but people were apparently copying the ending period along with the URL part - which made it inaccessible. I deleted the period.
Putting [url] around the URL will prevent the system from thinking that the point is part of the URL if you have a sentence ending with any URL.
What's a webbug?
A webbug is a tiny little insect living in fibre glass cables and eating tcp packets.
Seriously itīs a little transparent 1x1 graphic that gets referenced from any (possibly cached) html page and that may not be cached by any proxy servers. You can achieve that with the following little php script:
<?php
header("Content-Type: image/gif");
header("Expires: Mon, 26 Jul 1997 05:00:00 GMT"); /* date in the past */
$now = gmdate("D, d M Y H:i:s");
header("Last-Modified: $now GMT"); /* always modified */
header("Cache-Control: no-store, no-cache"); /* HTTP/1.1 */
header("Cache-Control: must-revalidate, post-check=0, pre-check=0", false);
header("Pragma: no-cache"); /* HTTP/1.0 */function hex2bin($s) {
for ($i = 0; $i < strlen($s); $i += 2) {
$bin .= chr(hexdec(substr($s,$i,2)));
}
return $bin;
}
print hex2bin('47494638396101000100800000ffffff0000002
1f90401000000002c00000000010001000002024401003b');?>
[edited by: mark_roach at 6:31 pm (utc) on Sep. 7, 2002]
[edit reason] split long string to prevent scroll problem [/edit]
I have indeed been thinking of putting in a webbug now that I know what you were refering to :). I understand your code too (that's scary! Am I becoming a geek :)) though I had never seen anyone create an image on the fly like that. Very interesting.
Only thing I don't understand is the relationship between your PHP code and the cached HTML page.
I am assuming that the file containing your PHP code is something you reference in an HTML img tag. Instead of the usual true image that is referenced by such tags.
Now that I think about it that's a very interesting possibility. Name a Perl or PHP script with an image name. Make it uncacheable and Have the cached page reference back to it. When it is accessed record all the usual HTTP_USER_AGENT, REMOTE_ADDR, and other variables.
Does anyone know if that would work? A little off from the origninal topic of this thread but just curious. If that works it would certainly allow me to gain the benefits of having my pages cached while still giving me a more or less accurate insight into how many of my pages are being seen.
Of course one would have to tell their Apache server to run any files ending in .gif or .jpg or other such ending as CGI scripts. As long as such files were placed in something like the CGI bin where no true image files would be.
Hmmmm....very interesting possibilities here....if it's workable.
Thanks.
Carlos
Of course one would have to tell their Apache server to run any files ending in .gif or .jpg or other such ending as CGI scripts. As long as such files were placed in something like the CGI bin where no true image files would be.
<img src="spacer.png" height="" width="" alt=""/>in your html code and use mod_rewrite to rewrite spacer.png to counter.php. That should hide it from most folks.
Of course the referer of that webbug will be your cached page, not the one people came from. Otherwise you got the idea. It does work and itīs used out there.
An absolutely AWESOME technique is all I can say. Your input has helped me put the pieces together. An ah hah! type of moment.
This forum is something else. I just got through reading the original Stanford paper outlining the inner workings of Google at
[www-db.stanford.edu ]
I found the link by looking up Brett (I can't remember his last name) from Search Engine World in the search engines and then after reading his incredibly informative articles searching for "analysis of Google ranking".
It all started by links or info I found on this forum.
What a great resource this is!
The stanford article gave me some very valuable insights that supported much of what Brett was saying in his articles. About the value of anchor text, size of pages and their relationship to whether Google likes to index them, and all kinds of things. I'm actually surprised it's still up on the web. I would have thought Google would want it taken down.
An awesome read for anyone who is interested though if I did not have an intense interest in the subject matter it would a bit too academic to swallow.
I will definitely be changing things at my sight. Not only adding a webbug as you mentioned Andreas but making some of my large pages much smaller.
Thanks again for everone's valuable input.
Carlos