homepage Welcome to WebmasterWorld Guest from 54.147.196.159
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
The example.com/page.htm/garbage/ attack
jetteroheller

WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4405474 posted 6:21 am on Jan 10, 2012 (gmt 0)

Since 2007, I have my .htaccess to have all strict canonical.

www redirected to not www
/index.htm redirected to /

I mean all this stuff, that there is only one version of one page.
I even cut ? and all behind

So

http://www.example.com/?here-comes-a-long-long-string
is redirected to
http://www.example.com/

But there is a new attack form needing a new counter measure

There is a page

http://example.com/page.htm

But the standard Apache server returns 200 OK at

http://example.com/page.htm/garbage/more-garbage/even-more-garbage/big-garbage-collection/only-garbage

From January 1st until today,
42017 visits of Googlebot with URLs like this.

I implement now in .htaccess to cut all behind .htm/ and redirect to .htm

 

jetteroheller

WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4405474 posted 9:50 am on Jan 10, 2012 (gmt 0)

Some more information:

The attack started July 2011

I found it only by Google Webmaster Tools because
of a network outage December 22th

Google Webmaster Tools listed 162 URLs not rechable because of
the network outage

The problem:

Apache returns something.htm, regardless what garbage is behind something.htm/garbage

The counter measure:

RewriteRule ^(.*).htm/(.*)$ http://example.com/$1.htm [R=301,L]

This cuts away all behind .htm/

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4405474 posted 11:28 am on Jan 10, 2012 (gmt 0)

Urk. If you can't make the server cut it out-- it sounds like mod_negotiation running amok-- I guess you are stuck with a rewrite. But it will be easier on your server if you say

RewriteRule ([^.]+\.htm). http://example.com/$1 [R=301,L]

(The dot after the parentheses is not a flyspeck.)

Nothing before the permitted .htm will contain a period, so that is the simplest way to constrain the search and still have the function stop in the right place. Otherwise it will have to keep backtracking until it finds the .htm in the middle.

No need for an opening anchor, because the search starts at the beginning by default and you're capturing everything you see. May as well include the .htm in the capture, since you're there anyway. Saves typing.

On the other hand you don't need to capture the part after .htm; you just need to ensure that there is at least one more character. Incidentally that will be enough to redirect any stray .html that sneaked in.

Now go see if you can talk sense into your server ;)

jetteroheller

WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4405474 posted 12:49 pm on Jan 10, 2012 (gmt 0)

it sounds like mod_negotiation running amok


The strange URLs are only accessed by crawl......googlebot.com
It's only one URL with this problem

The strange URL contains only folder names really existing on my server

folders on the site:

/
/folder-1
/folder-2
/folder-3
/folder-4
/folder-5
/folder-6

and the access is mixing them together

http://example.com/page.htm/folder-1/folder-5/folder-3/folder-6/folder-2/folder-4

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4405474 posted 1:42 pm on Jan 10, 2012 (gmt 0)

The strange URL contains only folder names really existing on my server

Looks like you are having problems with your website code and you should fix it. If these were physical static pages you wouldn't care about it and googlebot wouldn't even see the mangled links. Since you see them listed it implies your code is prone to url poisoning.

Somehow these links once accessed are replicated inside your various pages and anyone could theoretically create an infinite number of duplicated pages for your domain.

It's not a problem with apache and the htaccess rules can't fix this problem, you're just hiding the culprit in the particular case.

jetteroheller

WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4405474 posted 2:52 pm on Jan 10, 2012 (gmt 0)

Looks like you are having problems with your website code and you should fix it. If these were physical static pages you wouldn't care about it and googlebot wouldn't even see the mangled links. Since you see them listed it implies your code is prone to url poisoning.


All the page is static.
All links are static.
All content is produced with a CMS verifying all links.

So I have not idea what You mean

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4405474 posted 3:25 pm on Jan 10, 2012 (gmt 0)

So it's not static, you are using a CMS. Physical static pages is when you have actual files behind a request. In your case the CMS handles the friendly link requests and sets up the environment so the rest of your CMS code can work.

So someone feeds your CMS with the wrong link like the one you posted above. You need to figure out why the code takes whatever they pass through and propagates it. At least for some cases. It shouldn't.

jetteroheller

WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4405474 posted 4:49 pm on Jan 10, 2012 (gmt 0)

So it's not static, you are using a CMS. Physical static pages is when you have actual files behind a request. In your case the CMS handles the friendly link requests and sets up the environment so the rest of your CMS code can work.


No. The CMS is only on my notebook to create all the pages.
The pages on the server are all static.

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4405474 posted 7:28 pm on Jan 10, 2012 (gmt 0)

Ok, so are you saying you check your google wmt and you see errors with duplicate content? Or you just see accesses in your server log.

The server may respond to requests for its internal folders depending how your environment is mapped. You should be able to change it from your cpanel with directory aliases. For example I could setup the cgi-bin folder to be mapped somewhere in the main's domain web space so it will just be an invalid page if someone requests it. See how these are mapped on your server.

Now if someone requests a page by forcing some garbage in the url the thing is to process only the part you know about.

example.com/page.htm?test=1
example.com/page.htm/test/test/
example.com/page.htm#test

The server will return 200 and it's ok. It will be a problem if you see in your html source somewhere the "test" links being replicated, or in some way you trust such requests in your code.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4405474 posted 7:50 pm on Jan 10, 2012 (gmt 0)

You should ensure that
AcceptPathInfo is Off.

You should not be redirecting the duff URL requests. Doing that creates infinite URL space on your site. You should return 410 Gone or similar.

Be aware that RegEx patterns with a leading (.*) sub-pattern or multiple (.*) sub-patterns are ambiguous and very very inefficient. You should use something else.

SteveWh

5+ Year Member



 
Msg#: 4405474 posted 8:37 am on Jan 11, 2012 (gmt 0)

But the standard Apache server returns 200 OK for
http://example.com/page.htm/garbage/more-garbage/even-more-garbage/big-garbage-collection/only-garbage

Yes, it does, and serves the requested page that ends at .htm, ignoring the trailing garbage.

I've been hit by this, by a bot that says it's Baidu spider. Its IP traces to Baidu, but it is very badly behaved, doesn't seem to be working in a coordinated way with the other legitimate Baidu crawlers, and it consumes gigabytes of bandwidth. Like yours, mine mixes and matches directories that really do exist, in bizarre random combinations. I've thought this might be either a hacked crawling computer at the company, or one whose programming has gone horribly wrong, but there might be other possible explanations. It's been going on since at least early October.

My pages are static, too.

Does the IP you see actually trace to Google, Inc.?

I believe your countermeasure would more properly be:

RewriteRule ^(.+?\.htm)/.+$ http://example.com/$1 [R=301,L]

but that's not what I use, and I agree with g1smd that a redirect in this case isn't such a good idea. I just ban the garbage requests, which also cuts way down on the bandwidth used:

RewriteCond %{REQUEST_URI} ^/ThePageThatIsBeingHit\.htm/.+ [NC]
RewriteRule .* - [F]

jetteroheller

WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4405474 posted 10:17 am on Jan 11, 2012 (gmt 0)

Added yesterday to the robots.txt

Disallow /page.htm/

MikeNoLastName

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4405474 posted 10:22 pm on Jan 11, 2012 (gmt 0)

Thanks for this thread, I was looking for exactly this code. A while back, we had a similar issue where apparently someone linked us incorrectly and ended up putting a period after the url resulting in www.example.com/abc.htm. and I believe also www.example.com/abc.htm/ both of which resolve apparently under apache, both were being indexed and both were causing definite duplication penalties. I fixed each incident with 301's when I found them (they usually come up in the supplemental results), but this is way better.

jetteroheller

WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4405474 posted 5:44 pm on Jan 14, 2012 (gmt 0)

Today I noticed

1.) extreme low traffic on my site
2.) site:my-site.com 19100 instead of usual 10100 pages indexed
3.) stie:hit.my-site.com 10.000 indead of usual 2000 pages indexed

I just looked at the situation at google webmaster tools

hit.my-site.com

crasling errors - locked by robots.txt (100.000)

Just tried at google webmaster tools
to remove hit.my-site.com/file.htm/*

peego

10+ Year Member



 
Msg#: 4405474 posted 6:23 pm on Jan 14, 2012 (gmt 0)
I also have one static htm page (the entire site is 100% static pages) with similar issue that I found in WMT.

It looks something like this:

http://www.mysite.com/folder/page.htm/some-garbage

I was wondering, can I simply return a 410 status for that? For example, would it be ok if I used a line in my htaccess like so:

redirect 410 /folder/page.htm/some-garbage

Would that be ok?
g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4405474 posted 7:11 pm on Jan 14, 2012 (gmt 0)

Simplify:

RewriteRule ^folder/page\.htm. - [G]

Returns 410 for requests with anything at all (in the path) after the .htm part.

RewriteRule ^folder/page\.htm/ - [G]

Returns 410 for requests with .htm and a slash and then anything or nothing (in the path) after that.

The rule must go before your redirecting rules.

Additionally, as soon as you have one rule in your htaccess file using RewriteRule do not use Redirect or RedirectMatch for any of your other rules.

jetteroheller

WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4405474 posted 9:06 pm on Jan 14, 2012 (gmt 0)

Just checked at google webmaster tools, remove request successfull,
but site:my-site.com shows still far to much URLs, seems to take a lttile bit more time.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4405474 posted 11:17 pm on Jan 14, 2012 (gmt 0)

Google does not clean house overnight. The Crawl Errors page alone can go back months. What matters is that the current crawls are getting the correct responses: 200 if it exists, 404 or 410 if it doesn't.

You can also go into the "remove from index" part of gwt and have whole directories removed from the existing index. But be careful not to remove the real page at the same time.

jetteroheller

WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4405474 posted 8:17 am on Jan 15, 2012 (gmt 0)

Despite the message remove request successful,

site:hit.my-site.com increased to 13600

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4405474 posted 12:26 pm on Jan 16, 2012 (gmt 0)

You may want to get rid of the redirects unless you validate the target page is valid. And check in the server error log if the invalid pages you see googlebot accessing are recorded.

jetteroheller

WebmasterWorld Senior Member jetteroheller us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4405474 posted 5:34 am on Jan 23, 2012 (gmt 0)

Checking every day site:hit.my-site.ocm and my-site.com

Now it starts to decline

site:hit.my-site.com was several days over 20,000, now 12,200

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved