Forum Moderators: open
// Add more Spiders as you find them
$spiders = array("Googlebot","WebCrawler","Other Engines etc etc");
$spider_count = 0;
foreach($spiders as $Val) {
if (eregi($Val, getenv("HTTP_USER_AGENT"))) {
$spider_count++;
}
}
if ($spider_count!= "0") {
// Edit out one of these as necessary depending upon your version of html_output.php
$sess = NULL;
// $sid = NULL;
}
Now - is this safe for Google? Will the Googlebot think it's a site trying to cloak? I know Google sends out different IP's to hunt for cloaked sites and the last thing I want to do is get a site banned for cloaking... *shakes*
"Allow search bots to crawl your sites without session ID's or arguments that track their path through the site."
I'm playing around making a cart, where the php "session id" is needed for authentication.
It's done so that only logged in visitors see the string in the URL, which allows tracking/authentication of them, otherwise its a plain jane page served to n.e.one.
I'm not exactly savvy with PHP/whatever but unless the sessionid is part of the URL it shouldnt matter, even if you kill the session and change to a new one?
You aren't giving the search engine different content than what a human sees. You are just serving the content from a different URL. That in no way causes harm to their index. All it does is help them out by making sure they don't waste bandwidth on your site.
the problem i would check though is: does the bot receive exactly the same amount of data in bytes if a session id is handed over in one case and not handed over in another case?
i am not tecky enough to answer this question but i would assume that at least the total amount of data transferred between server and client is different in the two cases.
the question is: how would a decloaking-sider find out that a server delivers the same source code nomatter what is the user_agent or ip asking for that page?
I've managed to mess up my PHP config this week messing around with these php session ids and how they work, so this might not work ;)
If you have access to php.ini, change this line to make sure it is 0.
session.use_trans_sid = 1
AFAIK that puts the session ID automatically into URL's. If you change it to 0 the sessionid's are not in the URL's so I'd think Google wouldnt care whatever session it was because the URL would be the same.
I've not much of an idea how the whole shebang goes but tinkering with the php.ini session settings might make the problem a bit easier.
Worth a tinker or a look at maybe, though you'd probably have to start altering the commerce script aswell.
I did nearly the same thing on a site some month ago, and immediatly got the dreaded PR0 for it. (I really changed nothing else).
I agree it isn't cloaking, but googlebot will have some difficulty seeing the difference.
Now I am simply using cookies for session tracking, and if a user does not accept them (like googlebot), the fallback method of appending them as get-parameter is only used when somebody actually puts something in his cart - which will never happen in googlebots case. With this method, there is no risk of getting punished for cloaking.
@hyperion: Could it be that your ban is resulting from something else? As I have written I have multiple sites using that without problems with Google (and one site's PR went up to 6 this month).
This will simply turn off session-support for non-cookie-users completely.
I'm more at a loss each time with this. Here's a quote from the same php page you quote...I was reading it earlier
URL based session management has additional security risks compared to cookie based session management. Users may send an URL that contains an active session ID to their friends by email or users may save an URL that contains a session ID to their bookmarks and access your site with the same session ID always, for example.
So not all browsers accept cookies, and using URL's and having the sessionid in the URL poses a problem. It's making me wonder how it can be done....google aside :)
no, it was the only change I made for six months, and there are no links to other sites, so bad neighbourhood cannot have been the cause, either.
And when I dumped the change, I got back into google with the next update. Maybe it works if you use the PHP-session handling functions, so that googlebot knows only sessid is missing, but because I use my own set of session functions, I had a get-paramter with a different name.
But I wouldn't try again ;-)...
I am successfully using that exact script (as I am the "clever clogs" ;¬) that wrote it) across a number of Oscommerce sites that I admin.
Initial results:
270 products in the database of the main site I am tracking. Before adding the script *no* product pages were listed.
Since adding in the script, in the update over the past few days, there are now 257 products listed, all without SID...
If anyone can defaintely tell me that this is a harmful script, then I'll be glad to listen and make amends to it, but as of now, the proof is here. No products before the script was introduced, 257 of 270 listed after the script was added....
Would definately appreciate any comments! Thanks.
I did nearly the same thing on a site some month ago, and immediatly got the dreaded PR0 for it. (I really changed nothing else).
I agree it isn't cloaking, but googlebot will have some difficulty seeing the difference.
Googlebot doesn't "see" anything. It just retrieves links and stores data. If you were really only giving googlebot urls without a session id, there isn't any kind of automated way that googlebot can give you a penalty.
Loosing PR for a crawl cycle or two is fairly common. The fact that it happens doesn't mean you've been penalized.
No search engine has any legitimate reason to demand that sites that wish to be indexed must give up the right to track humans who use browsers with cookies disabled.
Any site that is serious about session tracking should setup a system that only excludes. Otherwise, you are giving up a significant amount of data.
Humans with cookies enabled get a cookie.
Humans with cookies disabled get a session id added to the url.
Spiders get neither.
IP/UA detction is the proper way to make that system work. And using such a system is no different than the geo targeting systems used by all search engines.
From this thread I have come to the conclusion, that using a similar script as provided by burt_online will be sufficient to enable spiders to index the product catalog.
Is this Script "universal" or will it only work for osCommerce? The Problem with ocCommerce is, that it doesn't feature any synchro options for ERP Software.
This is just my personal take, but allowing Googlebot to crawl without requiring session ID's should not run afoul of Google's policy against cloaking. I encourage webmasters to drop session ID's when they can. I would consider it safe. Fair enough?
Hope that helps,
GoogleGuy
I'm 99.9% sure that a script that only remove the SID for some visitors (the bots) could not be construed as "something terrible"...however that 0.01% made me think twice over the past day or two...
Onza: The main part of the script is suitable for any use to determine if the visitor is a spider (or not) based upon the User Agent...all you'll need to do is to change the end bit to suit your circumstances...
This is a great forum! Thanks all.
I just did my 1st osC site. I forced use of cookies & completly did away w/the session id. We've got about 2000 pages showing with the "site:" command after 2 months. Almost all have been cached.
Just need to add a little more PR to get the last few crawled.
Had some encouraging early results afa referrals & rankings.
Good luck,
rmjvol
It's my first post in this forum and I'd like to start with my 2 cents to improve the script ;)
1. Use $_SERVER["HTTP_USER_AGENT"] // if globals turned off
2. A break would help exiting earlier from a long spiders list, put the most important in the beginning
-------------------------
$spiders = array("Googlebot","WebCrawler, "etc etc");
$from_spider = FALSE;
foreach($spiders as $Val)
{
if (eregi($Val, $_SERVER["HTTP_USER_AGENT"]))
{
$from_spider=TRUE;
break;
}
}
// Session
if(!$from_spider)
session_start();
-----------------------
Thanks to all of you for all the valuable info!
<?php
/*
$Id: IsSearchEngine.inc.php,v 1.1.1.1 2000/06/07 22:41:53 af Exp $
$Name: $
*/
#
$br_array = array(%%NO_SESSION_UAS%%);
while (list(, $browser) = each($br_array)) {
if(preg_match("/$browser/i", $HTTP_USER_AGENT)) {
return true;
}
}
return false;
?>
The %%NO_SESSION_UAS%% field is filled in by my CMS.
It is called from header.php like this:
$sm = include('shared/IsSearchEngine.inc.php');Andreas