Forum Moderators: phranque

Message Too Old, No Replies

Google Issues Crawling

         

swapshop

2:40 am on Mar 22, 2006 (gmt 0)

10+ Year Member



Hopefully someone can assist us as I have run out of ideas

Been checking the site stats and found a issues

Using Googlebot as a example ( Gigablast)
Stats show the following.

[domain.com...] ( this has not query and is the issue)

[domain.com...] ( this is the working url)

Google is not allows to crawl via robots.txt ( part of the robots.txt file)

User-agent: Googlebot
Disallow: /site/detail.php

User-agent: Googlebot-Image
Disallow:

User-agent: MSNBot
Disallow:

User-agent: Slurp
Disallow:

User-agent: Teoma
Disallow:

Robots.txt file validates ok

We submit a google site maps submitted once per week ( no errors reported in google interface)

To explain the site map

This list (sitemap) is all adverts in a rewritten form

[domain.com...] is rewritten to

[domain.com...]

Mod rewrite for this in .htaccess is RewriteRule ([a-zA-Z0-9]*)\.htm$ [domain.com...]

Other points here are

RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^www\.domain\.com [NC]
RewriteRule ^(.*)$ [domain.com...] [R=301,L]

Allows domain.com or www.domain.com to go to www.domain.com

and

RewriteCond %{HTTP_USER_AGENT} "Google" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Slurp" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "MSNBOT" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "teoma" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "ia_archiver" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Scooter" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Mercator" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "FAST" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "MantraAgent" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Lycos" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "ZyBorg" [NC]
RewriteCond %{QUERY_STRING} SESSIONID
RewriteRule ^(.*)$ $1? [L,R=301]

and

php_value session.use_only_cookies 1
php_value session.use_trans_sid 0

This above code was to strip the PHPSession from the URL and make it invisible

Issue is unless Im wrong is google is attempting to crawl detail.php, it igores robots.txt

If you click [domain.com...] you get a mysql errror since it has no query.

In the offending page detail.php is has this code at the top for SE's

<?
// Use this to start a session only if the UA is *not* at search engine
$searchengines=array("Google", "Fast", "Slurp", "Ink", "Atomz", "Scooter", "Crawler", "MSNbot", "Poodle", "Genius");
$is_search_engine=0;
foreach($searchengines as $key => $val) {
//if(strstr("$HTTP_USER_AGENT", $val)) {
if(strstr($_SERVER['HTTP_USER_AGENT'], $val)) {
$is_search_engine++;
}
} if($is_search_engine==0) {
// visitor is not a search engine - start the session
ini_set("session.save_handler", "files");
session_start();
//You can put anything else in here that needs to be hidden from search engines
} else {
// visitor is a search engine - Put anything you want only a search engine to see in here
}

So what is meant to happen is google and other SE's crawl A, B is not allowed (robots.txt) and hence they never see C

A) www.domain.com/site/666.htm which is rewritten from B
B) detail.php?=site666 is normal
C) detail.php is not no query hence a error on mysql

So in summary

We tell google dont crawl /site/detail.php/ at all
We tell google via sitemaps submitted to view googlesitemap.txt which is the rewritten advert list

This [domain.com...] is rewritten to [domain.com...]

So why is google going to [domain.com...] and causing this issue?

I have come to my wits end and have no more ideas

PS Google crawled over 600 pages 2 days ago and comes back like this every week.

Thanks in advance

jdMorgan

6:34 am on Mar 22, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



swapshop,

I can't make sense of this question, because there are some internal inconsistencies among the (many) details. But maybe the following comments will help you:

This http://www.domain.com/site/detail.php?=site666 is rewritten to http://www.domain.com/site/666.htm

No, your code clearly rewrites the URL http://www.domain.com/site/666.htm to the local URL-path /site/detail.php?=site666

Mod rewrite for this in .htaccess is RewriteRule ([a-zA-Z0-9]*)\.htm$ http://www.domain.com/site/detail.php?siteid=$1

So Google requests /site/666.htm (which is not disallowed), the request gets rewritten to /site/detail.php?=site666, and so that page gets crawled.

You'll need to disallow google from all .htm files if you don't want detail.php to execute.

Jim

swapshop

3:27 am on Mar 23, 2006 (gmt 0)

10+ Year Member



Thanks Jim

This [domain.com...] is rewritten to [domain.com...]

No, your code clearly rewrites the URL [domain.com...] to the local URL-path /site/detail.php?=site666

Yes but No its the other way round. Google sees 666.htm not detail.php?=site666 this does work

Mod rewrite for this in .htaccess is RewriteRule ([a-zA-Z0-9]*)\.htm$ [domain.com...]

The orginal code I copied was

RewriteEngine on
#RewriteRule ^~(.*)$ /profile.php?profile=$1 [L]
RewriteRule ^([a-zA-Z0-9]*).html detail.php?siteid=$1

Our site doesnt like the ^ for some reason ( any ideas?)

Correct

So Google requests /site/666.htm (which is not disallowed), the request gets rewritten to /site/detail.php?=site666, and so that page gets crawled.

No other way around google is allowed 666.htm not detail.php?=site666

You'll need to disallow google from all .htm files if you don't want detail.php to execute.

No mis understand

Ok updated the robots.txt file to this

User-agent: Googlebot
Disallow: /site/detail.php

User-agent: *
blah
blah for all other SEs

So now I need to understand our google maps.

They can be a list via a txt file or a xml list

Which is better to submit to google maps txt list or a xml list?

Last quesiton we have gone from 44800 to 512 if you do site:www.domain.com

Why is this? Will it start to go back again?

jdMorgan

3:37 pm on Mar 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Our site doesnt like the ^ for some reason ( any ideas?)

This indicates a fundamental problem with the server, and I would not proceed until this problem is fixed.

XML versus text Google site-maps: Use whatever is more convenient for you and your colleagues to create, discuss, and maintain.

Jim