jetteroheller

msg:4405504 | 9:50 am on Jan 10, 2012 (gmt 0) |
Some more information: The attack started July 2011 I found it only by Google Webmaster Tools because of a network outage December 22th Google Webmaster Tools listed 162 URLs not rechable because of the network outage The problem: Apache returns something.htm, regardless what garbage is behind something.htm/garbage The counter measure: RewriteRule ^(.*).htm/(.*)$ http://example.com/$1.htm [R=301,L] This cuts away all behind .htm/
|
lucy24

msg:4405516 | 11:28 am on Jan 10, 2012 (gmt 0) |
Urk. If you can't make the server cut it out-- it sounds like mod_negotiation running amok-- I guess you are stuck with a rewrite. But it will be easier on your server if you say RewriteRule ([^.]+\.htm). http://example.com/$1 [R=301,L] (The dot after the parentheses is not a flyspeck.) Nothing before the permitted .htm will contain a period, so that is the simplest way to constrain the search and still have the function stop in the right place. Otherwise it will have to keep backtracking until it finds the .htm in the middle. No need for an opening anchor, because the search starts at the beginning by default and you're capturing everything you see. May as well include the .htm in the capture, since you're there anyway. Saves typing. On the other hand you don't need to capture the part after .htm; you just need to ensure that there is at least one more character. Incidentally that will be enough to redirect any stray .html that sneaked in. Now go see if you can talk sense into your server ;)
|
jetteroheller

msg:4405521 | 12:49 pm on Jan 10, 2012 (gmt 0) |
| it sounds like mod_negotiation running amok |
| The strange URLs are only accessed by crawl......googlebot.com It's only one URL with this problem The strange URL contains only folder names really existing on my server folders on the site: / /folder-1 /folder-2 /folder-3 /folder-4 /folder-5 /folder-6 and the access is mixing them together http://example.com/page.htm/folder-1/folder-5/folder-3/folder-6/folder-2/folder-4
|
enigma1

msg:4405536 | 1:42 pm on Jan 10, 2012 (gmt 0) |
| The strange URL contains only folder names really existing on my server |
| Looks like you are having problems with your website code and you should fix it. If these were physical static pages you wouldn't care about it and googlebot wouldn't even see the mangled links. Since you see them listed it implies your code is prone to url poisoning. Somehow these links once accessed are replicated inside your various pages and anyone could theoretically create an infinite number of duplicated pages for your domain. It's not a problem with apache and the htaccess rules can't fix this problem, you're just hiding the culprit in the particular case.
|
jetteroheller

msg:4405566 | 2:52 pm on Jan 10, 2012 (gmt 0) |
| Looks like you are having problems with your website code and you should fix it. If these were physical static pages you wouldn't care about it and googlebot wouldn't even see the mangled links. Since you see them listed it implies your code is prone to url poisoning. |
| All the page is static. All links are static. All content is produced with a CMS verifying all links. So I have not idea what You mean
|
enigma1

msg:4405574 | 3:25 pm on Jan 10, 2012 (gmt 0) |
So it's not static, you are using a CMS. Physical static pages is when you have actual files behind a request. In your case the CMS handles the friendly link requests and sets up the environment so the rest of your CMS code can work. So someone feeds your CMS with the wrong link like the one you posted above. You need to figure out why the code takes whatever they pass through and propagates it. At least for some cases. It shouldn't.
|
jetteroheller

msg:4405591 | 4:49 pm on Jan 10, 2012 (gmt 0) |
| So it's not static, you are using a CMS. Physical static pages is when you have actual files behind a request. In your case the CMS handles the friendly link requests and sets up the environment so the rest of your CMS code can work. |
| No. The CMS is only on my notebook to create all the pages. The pages on the server are all static.
|
enigma1

msg:4405656 | 7:28 pm on Jan 10, 2012 (gmt 0) |
Ok, so are you saying you check your google wmt and you see errors with duplicate content? Or you just see accesses in your server log. The server may respond to requests for its internal folders depending how your environment is mapped. You should be able to change it from your cpanel with directory aliases. For example I could setup the cgi-bin folder to be mapped somewhere in the main's domain web space so it will just be an invalid page if someone requests it. See how these are mapped on your server. Now if someone requests a page by forcing some garbage in the url the thing is to process only the part you know about. example.com/page.htm?test=1 example.com/page.htm/test/test/ example.com/page.htm#test The server will return 200 and it's ok. It will be a problem if you see in your html source somewhere the "test" links being replicated, or in some way you trust such requests in your code.
|
g1smd

msg:4405665 | 7:50 pm on Jan 10, 2012 (gmt 0) |
You should ensure that AcceptPathInfo is Off. You should not be redirecting the duff URL requests. Doing that creates infinite URL space on your site. You should return 410 Gone or similar. Be aware that RegEx patterns with a leading (.*) sub-pattern or multiple (.*) sub-patterns are ambiguous and very very inefficient. You should use something else.
|
SteveWh

msg:4405814 | 8:37 am on Jan 11, 2012 (gmt 0) |
But the standard Apache server returns 200 OK for http://example.com/page.htm/garbage/more-garbage/even-more-garbage/big-garbage-collection/only-garbage |
| Yes, it does, and serves the requested page that ends at .htm, ignoring the trailing garbage. I've been hit by this, by a bot that says it's Baidu spider. Its IP traces to Baidu, but it is very badly behaved, doesn't seem to be working in a coordinated way with the other legitimate Baidu crawlers, and it consumes gigabytes of bandwidth. Like yours, mine mixes and matches directories that really do exist, in bizarre random combinations. I've thought this might be either a hacked crawling computer at the company, or one whose programming has gone horribly wrong, but there might be other possible explanations. It's been going on since at least early October. My pages are static, too. Does the IP you see actually trace to Google, Inc.? I believe your countermeasure would more properly be:
RewriteRule ^(.+?\.htm)/.+$ http://example.com/$1 [R=301,L]
but that's not what I use, and I agree with g1smd that a redirect in this case isn't such a good idea. I just ban the garbage requests, which also cuts way down on the bandwidth used:
RewriteCond %{REQUEST_URI} ^/ThePageThatIsBeingHit\.htm/.+ [NC] RewriteRule .* - [F]
|
jetteroheller

msg:4405836 | 10:17 am on Jan 11, 2012 (gmt 0) |
Added yesterday to the robots.txt Disallow /page.htm/
|
MikeNoLastName

msg:4406026 | 10:22 pm on Jan 11, 2012 (gmt 0) |
Thanks for this thread, I was looking for exactly this code. A while back, we had a similar issue where apparently someone linked us incorrectly and ended up putting a period after the url resulting in www.example.com/abc.htm. and I believe also www.example.com/abc.htm/ both of which resolve apparently under apache, both were being indexed and both were causing definite duplication penalties. I fixed each incident with 301's when I found them (they usually come up in the supplemental results), but this is way better.
|
jetteroheller

msg:4406841 | 5:44 pm on Jan 14, 2012 (gmt 0) |
Today I noticed 1.) extreme low traffic on my site 2.) site:my-site.com 19100 instead of usual 10100 pages indexed 3.) stie:hit.my-site.com 10.000 indead of usual 2000 pages indexed I just looked at the situation at google webmaster tools hit.my-site.com crasling errors - locked by robots.txt (100.000) Just tried at google webmaster tools to remove hit.my-site.com/file.htm/*
|
peego

msg:4406850 | 6:23 pm on Jan 14, 2012 (gmt 0) |
I also have one static htm page (the entire site is 100% static pages) with similar issue that I found in WMT.
It looks something like this:
http://www.mysite.com/folder/page.htm/some-garbage
I was wondering, can I simply return a 410 status for that? For example, would it be ok if I used a line in my htaccess like so:
redirect 410 /folder/page.htm/some-garbage
Would that be ok?
|
g1smd

msg:4406860 | 7:11 pm on Jan 14, 2012 (gmt 0) |
Simplify:
RewriteRule ^folder/page\.htm. - [G] Returns 410 for requests with anything at all (in the path) after the .htm part.
RewriteRule ^folder/page\.htm/ - [G] Returns 410 for requests with .htm and a slash and then anything or nothing (in the path) after that. The rule must go before your redirecting rules. Additionally, as soon as you have one rule in your htaccess file using RewriteRule do not use Redirect or RedirectMatch for any of your other rules.
|
jetteroheller

msg:4406884 | 9:06 pm on Jan 14, 2012 (gmt 0) |
Just checked at google webmaster tools, remove request successfull, but site:my-site.com shows still far to much URLs, seems to take a lttile bit more time.
|
lucy24

msg:4406918 | 11:17 pm on Jan 14, 2012 (gmt 0) |
Google does not clean house overnight. The Crawl Errors page alone can go back months. What matters is that the current crawls are getting the correct responses: 200 if it exists, 404 or 410 if it doesn't. You can also go into the "remove from index" part of gwt and have whole directories removed from the existing index. But be careful not to remove the real page at the same time.
|
jetteroheller

msg:4406967 | 8:17 am on Jan 15, 2012 (gmt 0) |
Despite the message remove request successful, site:hit.my-site.com increased to 13600
|
enigma1

msg:4407262 | 12:26 pm on Jan 16, 2012 (gmt 0) |
You may want to get rid of the redirects unless you validate the target page is valid. And check in the server error log if the invalid pages you see googlebot accessing are recorded.
|
jetteroheller

msg:4409776 | 5:34 am on Jan 23, 2012 (gmt 0) |
Checking every day site:hit.my-site.ocm and my-site.com Now it starts to decline site:hit.my-site.com was several days over 20,000, now 12,200
|
|