Forum Moderators: coopster

preventing refresh loops

this must be a solved problem

         

Sutin

6:06 pm on Sep 18, 2025 (gmt 0)

10+ Year Member



I want to do the following (php pseudocode):
 if( cookie )
load page content
else
set cookie
refresh page


This ensures that a cookie exists before returning content (as many crawlers are too dumb for cookies.)

Question 1: Documentation says, "Keep in mind that the header function must be called before any output is sent to the client browser." Since setting a cookie and refresh are both header modifications, I can actually set the cookie and request a page refresh in any order, right?

Question 2: If the browser is set to not accept cookies, then how do I prevent an infinite loop?

markRg

9:09 pm on Sep 18, 2025 (gmt 0)

Top Contributors Of The Month



I would do it like this:

when a user visits the site, we set a cookie a=1 and always check it.

and then:

if ( isset($_COOKIE['b']) ) && isset($_COOKIE['a'])
echo "content"

elseif ( !isset($_COOKIE['b']) ) && isset($_COOKIE['a'])
refresh page
else
do nothing

Sutin

9:52 pm on Sep 18, 2025 (gmt 0)

10+ Year Member



I am missing where 'b' gets set, so it isn't clear why there are two cookies.

My current thought is that if I display a "Hi! I am checking if you are human..." and refresh after, say, 5 seconds, then I don't care so much if there is an infinite loop when cookies are blocked.

Brett_Tabke

11:44 am on Oct 1, 2025 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



This is how I solved it here in perl. You can adapt this easily to php (just let an LLM to convert it)


# --- JavaScript requirement gate (anon users only) ---
# script implements a JavaScript "challenge" to verify that a user's client
# can execute JavaScript. It's designed to filter out
# simple bots and scrapers that do not process JavaScript, while allowing
# legitimate users and sophisticated bots (like search engine crawlers, which
# are handled separately) to pass through.

# --- Tunables ---
# global configuration variables for the script's behavior.
# The name of the cookie that will be set by the JavaScript challenge.
$JS_COOKIE_NAME = 'jsok';

# The lifespan of the cookie in seconds. Default is 1 year.
$JS_COOKIE_MAX_AGE = 60*60*24*365;

# The domain for which the cookie is valid.
# Set to '.webmasterworld.com' to share the cookie across all subdomains.
# An empty string will default to the current host.
$JS_COOKIE_DOMAIN = 'www.webmasterworld.com';

#--------------------------------------------
# sub js_token_today
# Generates a date-based token string for the current day in UTC.
# The format is YYYYMMDD. This token is used as the expected value
# for the JavaScript verification cookie.
#
# Returns: A string representing today's date (e.g., "20251001").
#--------------------------------------------
sub js_token_today {
($sec,$min,$hour,$mday,$mon,$year) = gmtime time; # UTC flip
return sprintf("%04d%02d%02d", $year+1900, $mon+1, $mday);
}

#--------------------------------------------
# sub js_token_yesterday
#
# Generates a date-based token string for the previous day in UTC.
# The format is YYYYMMDD. This is used to create a grace window,
# allowing cookies set on the previous day to still be considered valid.
# This handles users crossing the midnight UTC boundary during a session.
#
# Returns: A string representing yesterday's date (e.g., "20250930").
#--------------------------------------------
sub js_token_yesterday {
($sec,$min,$hour,$mday,$mon,$year) = gmtime(time - 86400);
return sprintf("%04d%02d%02d", $year+1900, $mon+1, $mday);
}

#--------------------------------------------
# sub cookie_value
#
# A utility function to parse the HTTP_COOKIE environment variable
# and extract the value of a specific cookie by its name.
# Args: $name: The name of the cookie to retrieve.
#
# Returns: The value of the cookie if found, otherwise 'undef'.
#--------------------------------------------
sub cookie_value {
my ($name) = @_;
my $c = $ENV{HTTP_COOKIE} // '';
return ($c =~ /(?:^|;\s*)\Q$name\E=([^;]+)/) ? $1 : undef;
}

#--------------------------------------------
# sub wants_html_request
#
# determines if the current request is for an HTML document.
# This is used to avoid showing the JavaScript challenge for non-HTML
# resources like images, CSS files, or API calls.
#
# It checks the request method (GET/HEAD) and the HTTP Accept header.
#
# Returns: 1 (true) if the request appears to be for HTML, 0 (false) otherwise.
#--------------------------------------------
sub wants_html_request {
my $m = $ENV{REQUEST_METHOD} // '';
return 0 unless $m eq 'GET' || $m eq 'HEAD';
my $accept = lc($ENV{HTTP_ACCEPT} // '');
return 1 if $accept eq ''; # some simple clients omit Accept
return 1 if $accept =~ m{text/html} || $accept =~ m{application/xhtml\+xml};
return 0 if $accept =~ m{application/json|image/|text/css|javascript};
return 1 if $accept =~ m{\*/\*}; # Broad accept header, assume HTML
return 0;
}

#--------------------------------------------
# sub jsok_cookie_present
#
# Checks for the presence of a valid JavaScript verification cookie.
# A cookie is considered valid if its value matches the token for
# today or yesterday (the grace window).
# It also bypasses the check if a logged-in user session is detected
# (indicated by the global $LIU variable).
#
# Returns: 1 (true) if the user is verified, 0 (false) otherwise.
#--------------------------------------------
sub jsok_cookie_present {
my $v = &cookie_value($JS_COOKIE_NAME) // '';
return 1 if $v eq &js_token_today; # today
return 1 if $v eq &js_token_yesterday; # grace window
return 1 if length($Logged_In_User); # Assumes holds logged-in user info for WebmasterWorld
return 0;
}

#--------------------------------------------
# sub emit_js_challenge_and_exit
#
# This function is the core of the gate. It stops the normal request
# and serves a minimal HTML page with a JavaScript snippet. The script
# sets the verification cookie and then immediately redirects the user
# back to the originally requested URI. For clients without JavaScript,
# a message is displayed.
#
# The script terminates immediately after sending this response.
#
# No return value, as it calls 'exit'.
#--------------------------------------------
sub emit_js_challenge_and_exit {
&log_nojs_event;

my $uri = $ENV{"REQUEST_URI"} || '/';
my $maxage = $JS_COOKIE_MAX_AGE;
my $domain = $JS_COOKIE_DOMAIN ne '' ? "; Domain=$JS_COOKIE_DOMAIN"
: ($ENV{"HTTP_HOST"} // '') ne '' ? "; Domain=$ENV{HTTP_HOST}" : '';
my $today = &js_token_today;

print "Status: 200 OK\r\n";
print "Content-Type: text/html; charset=utf-8\r\n";
print "Cache-Control: no-store\r\n";
print "X-Robots-Tag: noindex, nofollow\r\n";
print "\r\n";

print qq|<html><title>JS Enabled?</title><script>(function(){
document.cookie="$JS_COOKIE_NAME=$today; Max-Age=$maxage; Path=/; SameSite=Lax; Secure$domain";
location.replace("$uri");
})();</script>
<noscript><b>JavaScript required</b><p>Please enable JavaScript and <a href="/">try again</a></noscript>|;
exit; #umm bye bye
}

# --- Main Gate Logic ---
# This is the main block that decides whether to trigger the Js challenge.
#
# 1. Skip if $sedomain is true (assumed to be a flag for known search engines/bots).
# 2. Check if the client is requesting an HTML page.
# 3. If so, check if the client has already passed the JS check (valid cookie or logged in).
# 4. If all conditions are met (HTML request without a valid cookie),
# then execute the challenge.
#
if (!$sedomain) {
if (&wants_html_request) {
if (!&jsok_cookie_present) {
&emit_js_challenge_and_exit;
}
}
}

#--------------------------------------------
# sub promote_jsok_cookie_header
#
# After a user has passed the JS gate, this function can be
# called on subsequent page loads to generate a "Set-Cookie" header that
# re-issues the cookie with the 'HttpOnly' flag. This is a security
# enhancement that prevents client-side scripts from accessing the cookie,
# mitigating certain XSS attacks.
# Returns: A full "Set-Cookie" header string.
#--------------------------------------------
sub promote_jsok_cookie_header {
my $domain = $JS_COOKIE_DOMAIN ne '' ? "; Domain=$JS_COOKIE_DOMAIN"
: ($ENV{"HTTP_HOST"} // '') ne '' ? "; Domain=$ENV{HTTP_HOST}" : '';
my $maxage = $JS_COOKIE_MAX_AGE;
my $today = &js_token_today;
return "Set-Cookie: $JS_COOKIE_NAME=$today; Max-Age=$maxage; Path=/; SameSite=Lax; Secure; HttpOnly$domain\r\n";
}

#--------------------------------------------
# sub log_nojs_event
#
# Logs an entry for each time the JavaScript challenge is triggered.
# This helps in monitoring how many non-JS clients are being gated.
# It writes the client's IP, host, User-Agent, and a timestamp to a
# daily log file.
# No returns
#--------------------------------------------
sub log_nojs_event {
# Build filename: nojs-log-MM-DD-YYYY.dat
my ($sec,$min,$hour,$mday,$mon,$year) = localtime(time);
$mon += 1;
$year += 1900;
my $file = <readact to prevent file access attempts>;

my $line ="$ENV{REMOTE_ADDR}|$ENV{REMOTE_HOST}|$ENV{HTTP_USER_AGENT}|" . time. "|";

open(FILEJS,">>$file");
flock(FILEJS,2) if $operatingsystem eq "unix"; # Use an advisory lock for safe concurrent writes
chmod(0755, $file) if $operatingsystem eq "unix";
print FILEJS "$line\n";
close FILEJS;
}


#test for se bot
sub CkSearchEngineCrawler {
# spider check
$spiderip= 0;
$sedomain= 0;
$SpiderAgent= 0;

#SEIPS is a list of searchengine ips
foreach $t (split(/ /,$SEIPS)) {
$t =~ s/ //gi;
next if !length($t);
$spiderip++ if $ENV{"REMOTE_ADDR"} =~ /$t/gi;;
}


#SEdomains is a list of sedomains "bing.com" etc
foreach $t (split(/ /,$SEDomains)) {
$t =~ s/ //gi;
next if !length($t);
$sedomain++ if $ENV{"REMOTE_HOST"} =~ /$t/gi;
}

#Finally agents:
$SEBotAgents =~ s/\&quot\;/\"/gi;
foreach $q (split(/\"/,$SEBotAgents)) {
$q =~ s/^\s+//;
$q =~ s/\s+$//;
next if !length($q);
$SpiderAgent++ if $ENV{'HTTP_USER_AGENT'} =~ /$q/gi
}

$sespider++ if $sedomain || $spiderip;
}

londrum

2:44 pm on Oct 1, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



you could just redirect the crawlers to a dead-end page with nothing on it. but loads of users will be blocking cookies as well, so i'd imagine you're going to lose loads of users

Sutin

3:24 pm on Oct 1, 2025 (gmt 0)

10+ Year Member



I solved this to my (current) satisfaction. If no existing cookie, then I have .htaccess set the cookie, add a random parameter to the query string, and return a 303 status code. If the crawler is sufficiently stupid, I never see it again. A person or a smarter crawler will automatically reload due to the 303 code.

If a human has session cookies disabled, then their browser should intelligently deal with an unending series of 303's.

Brett_Tabke

3:48 pm on Oct 1, 2025 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Not sure if I have ever seen (why would I?) a 303 in the wild. That looks like an interesting usage of it. Might try that.