Forum Moderators: phranque
Been checking the site stats and found a issues
Using Googlebot as a example ( Gigablast)
Stats show the following.
[domain.com...] ( this has not query and is the issue)
[domain.com...] ( this is the working url)
Google is not allows to crawl via robots.txt ( part of the robots.txt file)
User-agent: Googlebot
Disallow: /site/detail.php
User-agent: Googlebot-Image
Disallow:
User-agent: MSNBot
Disallow:
User-agent: Slurp
Disallow:
User-agent: Teoma
Disallow:
Robots.txt file validates ok
We submit a google site maps submitted once per week ( no errors reported in google interface)
To explain the site map
This list (sitemap) is all adverts in a rewritten form
[domain.com...] is rewritten to
[domain.com...]
Mod rewrite for this in .htaccess is RewriteRule ([a-zA-Z0-9]*)\.htm$ [domain.com...]
Other points here are
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^www\.domain\.com [NC]
RewriteRule ^(.*)$ [domain.com...] [R=301,L]
Allows domain.com or www.domain.com to go to www.domain.com
and
RewriteCond %{HTTP_USER_AGENT} "Google" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Slurp" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "MSNBOT" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "teoma" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "ia_archiver" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Scooter" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Mercator" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "FAST" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "MantraAgent" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Lycos" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "ZyBorg" [NC]
RewriteCond %{QUERY_STRING} SESSIONID
RewriteRule ^(.*)$ $1? [L,R=301]
and
php_value session.use_only_cookies 1
php_value session.use_trans_sid 0
This above code was to strip the PHPSession from the URL and make it invisible
Issue is unless Im wrong is google is attempting to crawl detail.php, it igores robots.txt
If you click [domain.com...] you get a mysql errror since it has no query.
In the offending page detail.php is has this code at the top for SE's
<?
// Use this to start a session only if the UA is *not* at search engine
$searchengines=array("Google", "Fast", "Slurp", "Ink", "Atomz", "Scooter", "Crawler", "MSNbot", "Poodle", "Genius");
$is_search_engine=0;
foreach($searchengines as $key => $val) {
//if(strstr("$HTTP_USER_AGENT", $val)) {
if(strstr($_SERVER['HTTP_USER_AGENT'], $val)) {
$is_search_engine++;
}
} if($is_search_engine==0) {
// visitor is not a search engine - start the session
ini_set("session.save_handler", "files");
session_start();
//You can put anything else in here that needs to be hidden from search engines
} else {
// visitor is a search engine - Put anything you want only a search engine to see in here
}
So what is meant to happen is google and other SE's crawl A, B is not allowed (robots.txt) and hence they never see C
A) www.domain.com/site/666.htm which is rewritten from B
B) detail.php?=site666 is normal
C) detail.php is not no query hence a error on mysql
So in summary
We tell google dont crawl /site/detail.php/ at all
We tell google via sitemaps submitted to view googlesitemap.txt which is the rewritten advert list
This [domain.com...] is rewritten to [domain.com...]
So why is google going to [domain.com...] and causing this issue?
I have come to my wits end and have no more ideas
PS Google crawled over 600 pages 2 days ago and comes back like this every week.
Thanks in advance
I can't make sense of this question, because there are some internal inconsistencies among the (many) details. But maybe the following comments will help you:
This http://www.domain.com/site/detail.php?=site666 is rewritten to http://www.domain.com/site/666.htm
No, your code clearly rewrites the URL http://www.domain.com/site/666.htm to the local URL-path /site/detail.php?=site666
Mod rewrite for this in .htaccess is RewriteRule ([a-zA-Z0-9]*)\.htm$ http://www.domain.com/site/detail.php?siteid=$1
So Google requests /site/666.htm (which is not disallowed), the request gets rewritten to /site/detail.php?=site666, and so that page gets crawled.
You'll need to disallow google from all .htm files if you don't want detail.php to execute.
Jim
This [domain.com...] is rewritten to [domain.com...]
No, your code clearly rewrites the URL [domain.com...] to the local URL-path /site/detail.php?=site666
Yes but No its the other way round. Google sees 666.htm not detail.php?=site666 this does work
Mod rewrite for this in .htaccess is RewriteRule ([a-zA-Z0-9]*)\.htm$ [domain.com...]
The orginal code I copied was
RewriteEngine on
#RewriteRule ^~(.*)$ /profile.php?profile=$1 [L]
RewriteRule ^([a-zA-Z0-9]*).html detail.php?siteid=$1
Our site doesnt like the ^ for some reason ( any ideas?)
Correct
So Google requests /site/666.htm (which is not disallowed), the request gets rewritten to /site/detail.php?=site666, and so that page gets crawled.
No other way around google is allowed 666.htm not detail.php?=site666
You'll need to disallow google from all .htm files if you don't want detail.php to execute.
No mis understand
Ok updated the robots.txt file to this
User-agent: Googlebot
Disallow: /site/detail.php
User-agent: *
blah
blah for all other SEs
So now I need to understand our google maps.
They can be a list via a txt file or a xml list
Which is better to submit to google maps txt list or a xml list?
Last quesiton we have gone from 44800 to 512 if you do site:www.domain.com
Why is this? Will it start to go back again?