Forum Moderators: coopster

Message Too Old, No Replies

bots are indexing bogus pages, php problem?

         

jake66

2:06 am on Nov 22, 2005 (gmt 0)

10+ Year Member



i use htaccess to rewrite from /product_info/product_id=00 to:
Product+Category/Product_Brand.html

the problem:
yahoo and google are indexing pages THAT DO NOT EXIST and never have.

they're indexing VALID category names, but they are also linking them to other categories. for instance:

Old+Antiques/1920s.html (valid)
the spiders find this no problem.

but for the past month, they have been finding url's like this:
Old+Antiques/1920s__1930s.html
what they are doing is merging one category with another, and this is producing a 200/OK response.

i can't figure out why.
could it be in my htaccess or my code?

if any part of the codes used needs to be posted.. please let me know.

this is from an oscommerce-based website, but i have already hit the osc forums and it has been determined to not be an oscommerce issue.

the htaccess and rewrite script i use is not stock oscommerce.
i have sent a bot to pull every page linked on my site, none of these url's are showing up anywhere.

for reference, here is my htaccess rule:
RewriteEngine on
RewriteBase /
RewriteRule ^([^/]*)\.html$ $1.php?%{QUERY_STRING} [NC]
RewriteRule ^/?(category)/([^/]*)\.html$ index.php?cPath=$2&%{QUERY_STRING} [NC]

jake66

2:07 am on Nov 22, 2005 (gmt 0)

10+ Year Member



i have also posted this in the htaccess forum, but was instructed to post my problem here instead.

jake66

3:32 am on Nov 22, 2005 (gmt 0)

10+ Year Member



it seems to have fixed itself, either that.. or there was a glitch with one of the categories i deleted today.

if anyone has any suspicions as to why this happened, i would love to hear it :)

jake66

10:57 am on Nov 26, 2005 (gmt 0)

10+ Year Member



this is happening again, appearently for no reason.
can anyone offer a suggestion what to look at?

Anyango

12:01 pm on Nov 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Although i don't really know what the reason could be, and i am shooting in the dark but i would suggest a simple way to avoid this situation, and that is let's say


i use htaccess to rewrite from /product_info/product_id=00 to:
Product+Category/Product_Brand.html

the problem:
yahoo and google are indexing pages THAT DO NOT EXIST and never have.

if you don't have something to show for this page

Old+Antiques/1920s__1930s.html

why are you sending back 200/OK?

i would send a 404 from php code to the client for any page for which i don't have information to show.

:)

jake66

5:31 pm on Nov 26, 2005 (gmt 0)

10+ Year Member



the thing is, i am not sending a 200/ok back. the site somehow is doing that on it's own. i suspect it's to do with either the htaccess rule or the php script that accompanies the rewrite rule.

for example:
the default oscommerce category url's are: /index.php?cPath=1

mine are rewritten to use the category name in the url.

i'm new to htaccess & php, perhaps there is something painfully obvious i am missing? how could i force a 404 response on these nonexistant url's?

jd01

6:54 pm on Nov 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The problem is Apache sends a '200 OK' header, because your script opened properly --- the page is 'found' --- Apache does not know that there is no information on your page.

You will need to check within your php script for a result, and if there is not any information, set a '404 Not Found' header manually, before anything is output by your php page.

EG
if(!$result=mysql_fetch_array($your_stuff)) {
header($_SERVER['SERVER_PROTOCOL']." 404 Not Found"); exit();
}

Hope this helps.

Justin

FalseDawn

8:18 pm on Nov 26, 2005 (gmt 0)

10+ Year Member



I think the problem is in your .htaccess rules - if the URL you enter is not matched by any rules, then a 404 should automatically be generated - you shouldn't have to mess about with using PHP to return that.

I guess that your rules are written in a way that non-existent categories are(wrongly) getting matched by one of the rules. This is usually not a problem, since the links within your site _should_ all point to sensible URLs, so (eg)
category1/bluewidget/ID55

would be the "same" page as
category1/blahblahblah/ID55

But you would never link to the latter, so it's not a problem.

OSCommerce can be notoriously finicky with caches and shared session directories which _might_ explain your overlapping category links.

Do you have 1 rule per category, or 1 rule to cover all cases?

jd01

8:43 pm on Nov 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



if the URL you enter is not matched by any rules...

Most .htaccess files use pattern matching for efficiency, and SEs request random URLs to see how you handle 404 errors on your site... (also, if you are my competitor and I notice you load a blank page when no information is found rather than serving a 404 I might be inclined to give you some links --- free! =)

The best answer is to ensure your php page serves a proper error if there are no results returned.

Justin

FalseDawn

8:53 pm on Nov 26, 2005 (gmt 0)

10+ Year Member



I might be interpreting the OPs problem incorrectly, but I thought that the problem is that no 404 was being returned when one was expected?
All I'm saying is that if this is not the case, then the .htaccess rules are probably wrong - correct these, and a 404 should automatically be returned.
I'm not sure what relevance "returning blank pages" has in this situation.

jd01

10:00 pm on Nov 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The relevence of blank pages is:

If your site returns blank pages, I (or SEs requesting random pages, with similar patterns to real pages) can create multiple, duplicate content pages on your site for you...

A pattern matching .htaccess file is easy to get around, but extensive, exact rules are time consuming --- There is nothing wrong with the .htaccess rules posted in message 1 of this thread (except that [^.] would be more efficient than [^/] AND an L flag with the NC --- [NC,L] --- would help a little too).

Unless you have *very* defined patterns in your URLs the best place to correct this issue is in the php file, not the .htaccess. To stop this in the .htaccess, you basically have to have a rule for every page. Why? because if you use a pattern, it will match more than you would like...

EG all of my pages are 3 letters, 1 uppercase, followed by 2 lowercase, to match I would use the following:
RewriteRule ^[A-Z][a-z]{2}\.html$ /somestuff.html [L]

The above would match my pages (Dog, Cat, Log), but would also match any other number of non-existent pages. (Rrr, Www, Lrw).

To actually correct the issue in the .htaccess you need to have a rule for every page... Much better to correct the issue in the php that should have been this way from the start.

Justin

hiker_jjw

2:19 am on Nov 27, 2005 (gmt 0)



I've seen similar problems before. Make sure you have "force session cookies" set to "true", otherwise you might be getting session ID's added to the end of your URLs. You will only notice the SID's on the first page load, as the script tries to create the session.

Also, try to figure out what URLs have been indexed. This will usually give you a clearer understanding of where the error is occuring.

I wrote some other SEO tutorials (Advanced SEO URLs) which can be found at the contributions. Take a look, they might help your understanding.

jake66

2:57 am on Nov 27, 2005 (gmt 0)

10+ Year Member



To actually correct the issue in the .htaccess you need to have a rule for every page... Much better to correct the issue in the php that should have been this way from the start.

i don't know where to start for that. i'm very new to htaccess & php.
could anyone offer a suggestion or pointers as what i need to fix?

for your reference, i have zipped all of the files as they are on my website (in relative folders as well)

i have also included a zip of the original contribution zip i downloaded from the oscommerce website that rewrites these url's.

you can download it here: (this doesn't work with a www for some reason):
s37.yousendit.com/d.aspx?id=1BBQL2TRG9CLI2HIKGXNFDFE00
it's in .zip format
7.44 KB in total size

i loaded it to yousendit.com to prevent unnecessary calls to my server (i don't know how many of you or lurkers are interested in downloading the file, and this is a very large forum..)
i don't know whether or not i'm allowed to post a zip with php and an htaccess file or not, if i am not... please delete it and i will simply post the source of the files needed.

:)

jake66

9:00 am on Nov 28, 2005 (gmt 0)

10+ Year Member



not sure if this makes any difference, but i just renamed one of the most hit "fake" subcategories from 1920s to 1950s.
original URL:
/category/Old+Antiques/1920s.html

i hit reload on the offending URL....
all that showed were items relating to "1950s" instead of what originally showed (1920s) stuff.

once i renamed the subcategory back to 1920s & hit reload.... 1920s-related products shown again.

FalseDawn

10:36 pm on Nov 28, 2005 (gmt 0)

10+ Year Member



"If your site returns blank pages, I (or SEs requesting random pages, with similar patterns to real pages) can create multiple, duplicate content pages on your site for you..."

Yes, I figured out what I was missing after I posted, and you are quite correct.

hiker_jjw

3:06 am on Nov 29, 2005 (gmt 0)



I took at look at your rewrite rules, and they look fine to me. It must be an overlook within the PHP code that is creating your unusual URLs. IMHO, the contribution you are using is not the best SEO method for osCommerce. Contact me thru stickymail and I'll point you to another method or two. Jeff