Forum Moderators: phranque

Message Too Old, No Replies

File does not exist error in Apache log from Search Engines Only

Search Engines generate file does not exist error

         

OahuRE

1:29 am on Feb 27, 2012 (gmt 0)

10+ Year Member



I am getting a file does not exist error message in the Apache log file but when I copy the URL into my browser it always works. The URL is using the following RewriteRule.

RewriteRule ^MLSNUM([^/\.]+)P([^/\.]+)ADDRESS([^\.]+)?$ details.php?M1=$1&MyPropertyType=$2&OneProperty=Y&AllPhotos=Y [L]


What would cause the search engines to fail but everytime I try it in the browser it works?

g1smd

1:51 am on Feb 27, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is their request including some encoded characters?

([^/\.]+)P
is dangerous as the character group [^/.]+ will initially "eat" the P. You should probably use ([^P/.]+)P here. Likewise for the A.

Why doesn't your script validate the $3 content? It leaves your site open to Duplicate Content and malicious linking issues.

OahuRE

2:47 am on Feb 27, 2012 (gmt 0)

10+ Year Member



I am a .htaccess novice so it sounds like I have some serious issues and will research and test what you are saying above, thanks.

Here is a look at some of the errors.

[Sun Feb 26 15:18:39 2012] [error] [client 180.76.6.232] File does not exist: /home/oahure/public_html/MLSNUM1009123P2ADDRESS-66-303-Haleiwa-Rd-A
[Sun Feb 26 15:18:41 2012] [error] [client 180.76.5.145] File does not exist: /home/oahure/public_html/MLSNUM1008875P2ADDRESS-4280-Salt-Lake-Blvd-F


The URL in the browser would look like this.

[oahure.com...]

I just noticed all the recent errors have a dash unit number at the end, for example 4280-Salt-Lake-Blvd-F has the -F at the end. Not sure if this is significant but I don't see any errors that look like 4280-Salt-Lake-Blvd

lucy24

4:48 am on Feb 27, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The "File does not exist" can mean that the requested URL isn't getting picked up by your rewrite, or that the details.php file isn't where the server expects to find it, or the php isn't properly handling the information you send it. Apache can only deal with the first two possibilities.

Requested file:

/home/oahure/public_html/MLSNUM1009123P2ADDRESS-66-303-Haleiwa-Rd-A

RewriteRule as written:

^MLSNUM([^/\.]+)P([^/\.]+)ADDRESS([^\.]+)?$ details.php?M1=$1&MyPropertyType=$2&OneProperty=Y&AllPhotos=Y

Let's try a walkthrough.

MLSNUM -- so far so good --
[^/\.]+ = 1009123P2ADDRESS-66-303-Haleiwa-Rd-A --so far so good --
P -- whoops! I will have to backtrack to find a P
{nanoseconds go by}
[^/\.]+ = 1009123 -- so far so good --
P -- so far so good --
[^/\.]+ = 2ADDRESS-66-303-Haleiwa-Rd-A --so far so good --
ADDRESS -- whoops again! more backtracking --
[^/\.]+ = 2 --so far so good --
ADDRESS -- so far so good --
([^\.]+) = -66-303-Haleiwa-Rd-A -- so far so good --
? = Oh. It wouldn't have made a difference if I didn't pick up anything here
$ = and that was all she wrote

Although it doesn't directly address (haha) your problem, two things jumped right out at me. The pieces between MLSNUM and P, and again the pieces between P and ADDRESS, are all numerals. Is it always like that? If so, you can save an enormous lot of server resources by constraining your search to [0-9]+ or (to taste) \d+

What happens with the -66-303-Haleiwa-Rd-A piece? It isn't going into the query string; in fact it doesn't have to exist at all. Is it just there for decoration? That brings you into Infinite URL Space territory.

Meanwhile, I think you need to look at that php file.

OahuRE

7:12 am on Feb 27, 2012 (gmt 0)

10+ Year Member



Yes, the pieces between are always all numerals. I will try your suggested change, thanks.

The address portion is strictly for SEO purposes, you are right, I do not need to pass it to the php page because with the numbers after the MLS and the P I already know everything about the property including the address.

I don't think it is the PHP file, as I never get an error that directly links to it and I changed most of my links to do that. Also, every error message like my examples in the Apache Error Log always work when I test it in the browser.

With my limited knowledge of .htaccess I don't follow why you say "P -- whoops! I will have to backtrack to find a P" and if there might be a better way to do that.

g1smd

7:55 am on Feb 27, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



[^/.]+ matches all characters to the very end of the input string, including the P, so the parser then has to try hundreds of "back off and retry" trial matches to get the right part. Use [0-9]+ instead.

I do not need to pass it to the php page because with the numbers after the MLS and the P

If you fail to pass $3 to your script and validate it, people can link to your site like this
example.com/MLSNUM1009123P2ADDRESS-extensive-fire-damage-with-a-crack-den-next-door 
and your site will return "200 OK" meaning it is a valid page.

lucy24

10:11 am on Feb 27, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, wait, let me make sure we're talking the same language. When you say "put the url into the browser" do you mean the original url-- the one that is about to be rewritten-- or do you mean the target url-- the one you get after the rewrite?

There may be further complications.

Or do you mean that error logs say such-and-such does not exist, and then when you paste "such-and-such" into the address bar, it's right there? If so it may be a pretty straightforward matter of getting your rewriteBase garbled.

OahuRE

4:45 pm on Feb 27, 2012 (gmt 0)

10+ Year Member



When I put the target ULR into the browser it always works. This is the one that has been rewritten.

Also, I never had an error from the direct links that are not rewritten to my page, which is why I do not feel there is a php problem with the page.

OahuRE

5:03 pm on Feb 27, 2012 (gmt 0)

10+ Year Member



OK, I am now using [0-9]+ and that works great. I am not sure if it will fix my issue, but it is an improvement.

Regarding validating $3 I could try to compare the address entered with the one for that property, but that could bring about some problems because if there was any small difference then a valid page would not load.

I understand no one wants a link "with-a-crack-den-next-door" tied to their property, but the link would need to show up high in the search engines and it does not seem like it could do anything malicious to my Website. So while validating it has the risk of shutting down legitimate pages, not validating it does not seem to have a huge risk unless I am not understanding how something malicious could be done to my site.

g1smd

6:51 pm on Feb 27, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



there was any small difference then a valid page would not load

One of the first things your script should do is compare the $3 text with what is in the database for the current request. If there is no match the PHP script should send a 301 redirect to the correct URL.

lucy24

1:14 am on Feb 28, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I want to go back to something because we got sidetracked.
What would cause the search engines to fail but everytime I try it in the browser it works?

Do you mean like this?

-- Access logs report request for such-and-such file, with search engine as referer.
-- Error logs for the same request say some-other-file does not exist.
-- The filename in your access logs is the filename that comes up in a search-engine result.
-- The filename in your error logs is the filename that you get after rewriting that request.

I just detoured to double-check what happens when you rewrite to a nonexistent page. I first renamed my "I don't like your face" page and then went to my site in MSIE 5, which normally gets rewritten to this page. What I got, physically, was my custom 404 page. Access logs name the page I was aiming at, with-- oops! didn't expect this!-- a status of 200, and then the error directory's stylesheet. Error logs name the "I don't like your face" page along with "does not exist".

Well, that's unnerving. Why am I getting a 200 for a page that doesn't exist?

:: lightbulb ::

It's got something to do, in some way, with something seemingly unrelated I asked about ages ago: if you rewrite an image, you never get back a 304. Always a fresh 200, eve with back-to-back requests.

Sorry. Drifting a bit afield from your question. But out of curiosity, do the access logs that match your "file does not exist" errors show a 200 or a 404?

Regarding validating $3 I could try to compare the address entered with the one for that property, but that could bring about some problems because if there was any small difference then a valid page would not load.

Do something simple with a database at the same time that you first generate the code for the address. So $1 $2 and $3 are all stored in the same place and all you have to do is check them against each other. You probably just need to tweak some existing formatting, since obviously you know where Address #123456 is even if you don't store the information under "123 Main Street".

If there's a mismatch, bounce the visitor back to the page that displays the properties, or whatever it is you do when people come in "cold". Or send them to a custom error message that says "We goofed!" That always sounds nicer than "You deliberately tried to mess up our database, didn't you?"

OahuRE

4:19 am on Feb 28, 2012 (gmt 0)

10+ Year Member



Thanks, I have not looked at the access logs, just the error log. I now get the following errors in the error log:

[Mon Feb 27 15:36:58 2012] [error] [client 199.21.99.65] File does not exist: /home/oahure/public_html/MLSNUM1110489P1 target=
[Mon Feb 27 15:37:41 2012] [error] [client 199.21.99.65] File does not exist: /home/oahure/public_html/MLSNUM1108561P1 target=
[Mon Feb 27 15:38:10 2012] [error] [client 199.21.99.65] File does not exist: /home/oahure/public_html/MLSNUM1103212P1 target=
[Mon Feb 27 15:39:34 2012] [error] [client 199.21.99.65] File does not exist: /home/oahure/public_html/MLSNUM1105017P1 target=
[Mon Feb 27 15:43:28 2012] [error] [client 199.21.99.65] File does not exist: /home/oahure/public_html/MLSNUM1110466P1 target=

I took a peek at the access log and it is mostly all of the following, not really much interesting.

127.0.0.1 - - [27/Feb/2012:17:49:09 -1000] "OPTIONS * HTTP/1.0" 200 -
127.0.0.1 - - [27/Feb/2012:17:49:20 -1000] "OPTIONS * HTTP/1.0" 200 -
127.0.0.1 - - [27/Feb/2012:17:49:22 -1000] "OPTIONS * HTTP/1.0" 200 -
127.0.0.1 - - [27/Feb/2012:17:49:30 -1000] "OPTIONS * HTTP/1.0" 200 -
127.0.0.1 - - [27/Feb/2012:17:49:31 -1000] "OPTIONS * HTTP/1.0" 200 -
127.0.0.1 - - [27/Feb/2012:17:49:32 -1000] "OPTIONS * HTTP/1.0" 200 -
127.0.0.1 - - [27/Feb/2012:17:49:33 -1000] "OPTIONS * HTTP/1.0" 200 -

I assume the new error message in the log was caused by the change I did to the htaccess file.

The line now looks like this:

RewriteRule ^MLSNUM([0-9]+)P([0-9]+)ADDRESS([^\.]+)?$ details.php?M1=$1&MyPropertyType=$2&OneProperty=Y&AllPhotos=Y [L]

I have no idea where it is getting the target= stuff. If you drop off the target= the first part of the file name works fine in the browser.

lucy24

6:30 am on Feb 28, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:!: I know where it's coming from. It's google's own fevered imagination. Somewhere there's a link that says, in part, "target = '_blank'". And this is getting garbled into the url. Similar things show up all the time in gwt. There are whole threads about it.

Met a robot in January that was so godawfully stupid, if it saw anything in an anchor, it thought it was an external link. In fact it's enshrined in my records as "Stupid Robot":
<a class = 'external'
<a name = 'chapter'
<a rel = 'nofollow'
<a href = '#fragment'
and so on all got interpreted as <a href = "gohere">

OahuRE

6:38 pm on Feb 28, 2012 (gmt 0)

10+ Year Member



OK, thanks, I will search out that link and change it. I would like to get rid of all the Googlebot errors if possible as they fill up my error_log file fast.