Forum Moderators: phranque

Message Too Old, No Replies

strange directories being crawled but do not exist

googlebot gains access to weird directories

         

classa

10:30 pm on Feb 22, 2006 (gmt 0)

10+ Year Member



I am seeing in my NetTracker where googlebot is finding and crawing directories on my website that structurally look like this;

/index.php/links/blog/
/index.php/links/links/
/index.php/links/news/
/index.php/news/news/
/index.php/news/blog/
/index.php/news/links/

The funny thing is, these directories do not exist on my server, yet when I go to one of these directories in my browser, it shows pages from that directory minus graphics and minus css. What I am afraid of is a potential duplicate content penalty or worse yet, being accused of having doorway pages. My robots.txt page is not 100 miles long blocking access to all of the twisted potential directories that do not even exist.

We do have urls that look like mydomain.com/index.php?method=Some_Method

Why would appache allow the #*$!ization of my URL's? I have been able to get the same page by removing the index.php and just run mydomain.com/?method=Some_Method

Any thoughts as to how this is happening and how we can fix it?

jdMorgan

5:43 am on Feb 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It sounds like the code used to rewrite requests for static URLs to your script is buggy. Apache itself won't server "made-up" non-existent file requests.

If your config code is set up to rewrite *all* static URLs to your script, then the script must hande the missing-content case itself.

Jim

classa

4:10 pm on Feb 23, 2006 (gmt 0)

10+ Year Member



jdMorgan,

Thanks for the quick reply, but here is my .htaccess file below and I do not see where we may be getting into trouble with this one...

RewriteEngine On
RewriteBase /

#For non ssl ports
RewriteCond %{HTTP_HOST}!^www\.myurl\.com
RewriteCond %{SERVER_PORT} ^80
RewriteRule (.*) [myurl.com...] [R=301,L]

#For ssl ports
RewriteCond %{HTTP_HOST}!^www\.myurl\.com
RewriteCond %{SERVER_PORT} ^443
RewriteRule (.*) [myurl.com...] [R=301,L]

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /blog/
RewriteCond %{REQUEST_FILENAME}!-f
RewriteCond %{REQUEST_FILENAME}!-d
RewriteRule . /blog/index.php [L]
</IfModule>

<Files ~ "\.inc*">
Order allow,deny
Deny from all
</Files>

jdMorgan

4:39 pm on Feb 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This snippet will rewrite any request for a file or directory that does not exist as a 'real' file or directory to the /blog/index.php script.

RewriteEngine On
RewriteBase /blog/
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /blog/index.php [L]

Because the scope appears to be limited to the /blog subdirectory, this may not explain the entire problem, but if the /blog/index.php script does not return a 404 or 410 status for "missing or undefined content" requested from /blog, then you'd see the kind of problem you're reporting in the /blog subdirectory.

---

Another issue is the "form" of the URLs being requested. Because "." indicates a filetype, Apache does not look past the file extension of the URLs containing ".php". That is, /index.php/apple and /index.php/fruit/apple will both resolve to the /index.php file. Apache won't consider the URL-path starting with the first slash after the "." filetype delimiter.

You could detect and reject this type of URL, as long as you don't actually use any others like it:


# Return 410-Gone response for index.php requests with /<anything> appended (for HTTP/1.1 clients and later)
RewriteCond %{HTTP_HOST} .
RewriteRule ^index\.php/ - [G]
#
# Return 404-Not Found response for index.php requests with /<anything> appended (for HTTP/1.0 clients)
RewriteRule ^index\.php/ /file_that_does_not_exist [L]

Or you could 'correct' the request to tell the search engines to correct their URL database if this problem was caused by invalid links on pages on any site:


# Redirect requests for index.php with /<anything> appended
RewriteRule ^index\.php/ http://www.example.com/index.php [R=301,L]

Jim