Forum Moderators: phranque

Message Too Old, No Replies

redirect .htmlAll to .html

         

LilyTousi

2:12 pm on Jan 25, 2016 (gmt 0)

10+ Year Member



Hi,
Hope you can help me with this matter.
Recently, I am flooded with bad referrer from hotels-in.xyz that generates error 404 in my google webmaster tools.
What the system does ... it takes a valid html file, but add the first word of the meta tag title at the end of the html
For example ... abc.html shows up as abc.htmlAll or abc.htmlDiscover etc. etc.
How to code it in .htaccess to prevent this ?
Thanks
Lily

whitespace

2:45 pm on Jan 25, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



You could do something like the following in your root .htaccess file:


RewriteEngine On
RewriteRule ^([^.]+\.html)\w+$ /$1 [R=301,L]


This basically removes anything that occurs after ".html" in the URL. This also assumes that you only have one dot (if any) in the URL (ie. the one in ".html"). \w is a shorthand character class which equates to [a-zA-Z0-9_].

lucy24

8:28 pm on Jan 25, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For this kind of thing I'd use a "two steps forward, one step back" approach. 99 requests out of 100 won't involve .html-more-stuff, so no sense in asking your server to start capturing on every request. Like this:
RewriteCond %{REQUEST_URI} ^([^.]+\.html)
RewriteRule \.html. http://www.example.com%1 [R=301,L]
Note the absence of a closing anchor: all you really need is ".html with more stuff after it" and you don't even need to specify what the more stuff is.

The two approaches are otherwise identical, except that-- ahem, cough-cough-- every redirect target should include the full protocol-plus-domain. If they got one thing wrong they may have got other things wrong too.

Either way, we're assuming the URL doesn't include any periods before the one in ".html". (Literal periods in URL paths are perfectly legal-- apache dot org itself uses them-- but if you don't happen to need them, it simplifies a lot of rules.)

whitespace

9:15 pm on Jan 25, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks for the optimisations @lucy24!

LilyTousi

9:18 pm on Jan 26, 2016 (gmt 0)

10+ Year Member



Many thanks for your help!

LilyTousi

6:05 pm on Feb 1, 2016 (gmt 0)

10+ Year Member



Sorry to bother you again!
One of my site has been on the web for many years and the first version had all files ending with .htm (instead of .html)
I am experiencing the same bad referrer problem (ie. .htmAll, .htmDiscover and so on).
I tried what you suggested (RewriteRule ^([^.]+\.html)\w+$ /$1 [R=301,L]) but replacing \html by \htm, but it failed!
Any clue ?
Thanks

lucy24

9:55 pm on Feb 1, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If the real URLs end in htm, not html, then the rule becomes (in the longer form)
RewriteCond %{REQUEST_URI} ^([^.]+\.htm)
RewriteRule \.htm. http://www.example.com%1 [R=301,L]
That's assuming you don't have mixed htm and html on the same site. (Ugh! What a mess!) If you did, the rule would have to be
RewriteCond %{REQUEST_URI} ^([^.]+\.html?)[A-Z]
RewriteRule \.html?[A-Z] http://www.example.com%1 [R=301,L]
... And if the bad URLs don't always start in a capitalized word, then I wash my hands of you :)

replacing \html by \htm

Can I assume that was a typo?

whitespace

10:57 pm on Feb 1, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



...the first version had all files ending with .htm (instead of .html)


And what version are you on now? What do the files end with now?

I am flooded with bad referrer from hotels-in.xyz


Is that the actual domain from which you are getting "incorrectly formed" traffic? Have you confirmed that there is nothing on your site that might have resulted in these malformed links? Otherwise it sounds like quite a fundamental error on their part which is likely to have resulted in a lot of corrupt outbound links?!

lucy24

3:17 am on Feb 2, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, wait, are you redirecting from .htm-plus-garbage to .htm-alone, or from .htm-plus-garbage to .html-alone? If the latter, you need to add an "l" (that is, a literal letter ell, haha) to the target. And if you've currently got mixed .htm and .html, things get even more fun. Not impossible, just more fun.

I'm wondering the same thing as whitespace: Are these really bona fide referers that are just coming in misspelled? Or just a bunch of referer spam?

I don't think the URL error is that implausible, though. All they'd probably have to do-- at their end, not ours-- is click the wrong button at the wrong time in their CMS. Analogously, I've met stupid robots who interpreted the first thing inside an <a> element, no matter what that thing happens to be, as a relative link, so for example "<a class = 'outside'" leads to a request for "example.com/blahblah/outside". It would be funny if it weren't so annoying.

:: quick detour to logs, followed by recoil of alarm as I see that I either made a mistake or the glitch is more common than I thought ::

(Combination of both, it turns out, and also a whole lot of robotic stupidity. I'd forgotten that I have a clutch of files in one subdirectory named .html.zip) But look, here's a couple of unimpeachable examples-- there were lots of others-- where I know the mistake wasn't mine:
66.249.64.126 - - [01/Mar/2015:20:39:52 -0800] "GET /ebooks/perez/Perez.htmlOnce HTTP/1.1" 404 1412 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
....
66.249.67.42 - - [05/Apr/2015:15:34:11 -0700] "GET /hovercraft/april_blues.htmlMr HTTP/1.1" 404 1412 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I think that's the same kind of thing OP is describing.

whitespace

8:06 am on Feb 2, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



Presumably these are/were also reported in Google Search Console / Webmaster Tools? From where did Googlebot find these URLs?

Where does "Once" and "Mr" come from? The "meta tag title"(?) as well?

The second IP address doesn't appear to have a reverse DNS listed? So, doesn't validate as a real Googlebot? (Maybe because the logs are a bit old?)

Maybe a bug in a version of a popular CMS/plugin?!

lucy24

9:17 pm on Feb 2, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Where does "Once" and "Mr" come from?

Search me. But there were plenty more; I just picked two at random, using Googlebot examples because those are unambiguously not referrer spam. There aren't any recent ones. Mercifully, the Googlebot seems to pull the plug pretty quickly on URLs that have never returned anything but a 404. They don't come back sporadically year after year with a fresh request the way they do with formerly-valid, now-retired URLs..

Maybe a bug in a version of a popular CMS/plugin?!

That certainly seems the most likely.

:: quick detour to WMT/GSC ::

Oh, there is one current: http://example.com/ebooks/paston/paston5.htmlTHE
Last crawled: 12/13/15
First detected: 12/13/15
-- which strongly suggests that they can tell it's a mistake, and only ever tried it once.
Linked from: vebidoo.de/blahblah
Shrug. It's one of those pseudo-directory sites, isn't it? I tend to see the name lurking around the bottom of my "links to you" lists. In fact the "link" is still listed, but now the URL is correct, so no way to check. (There are always a few spurious links to Old/Middle English content, since spelling of random words might happen to be the same as some obscure name. Same with human visitors misspelling search terms-- especially humans who are apparently too dumb to look at the snippet and figure out that this can't possibly be what they're looking for.)

LilyTousi

1:42 pm on Feb 4, 2016 (gmt 0)

10+ Year Member



Is that the actual domain from which you are getting "incorrectly formed" traffic? Have you confirmed that there is nothing on your site that might have resulted in these malformed links? Otherwise it sounds like quite a fundamental error on their part which is likely to have resulted in a lot of corrupt outbound links?


When I access Google Webmaster Tool, I can see who is the bad referrer. In this case I have these ones : hotels-in.xyz and top1hotel.com
I have many websites, and they are all « infected » by these bad referrers.
It is the first time I see this problem. It is probably the way they construct the link. Sometimes I have links without extension or with &amp; or with or %3E or with %20 as so on.
Many thanks for your help

whitespace

3:43 pm on Feb 4, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



In this case I have these ones : hotels-in.xyz and top1hotel.com


Well, that's the thing... neither of these sites are "valid". They are not indexed at all in Google and if you try to access them you just get an error:

Fatal error: Call to undefined function view_index() in /var/www/html/controller/index.php on line 26


(Incidentally, the very same error on both sites - alarm bells ringing?)

So, there is probably no real benefit in redirecting this traffic!? (Apart from clearing down the report in Google Search Console / formerly GWT.)

LilyTousi

1:27 pm on Feb 5, 2016 (gmt 0)

10+ Year Member



wow! this is weird. When I first discovered these sites. they were active!
There were some sort of search engine for hotels around the world.
If they are gone for good ... then it is a good thing for everyone who had the same problem.

LilyTousi

2:14 pm on Mar 21, 2016 (gmt 0)

10+ Year Member



Hi,
Sorry to bother you again.
The code you gave me to redirect (ex,: htmlAll or htmlDiscover, etc.)
RewriteRule ^([^.]+\.html)\w+$ /$1 [R=301,L]

works perfectly, BUT I recently found out that there are other bad referrers using special characters like .. html.http or html&sa
is there a way to use the code, but with a larger spectrum that would include special characters ?
There should be nothing after .html
Many thanks for your support!

lucy24

4:20 pm on Mar 21, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



is there a way to use the code, but with a larger spectrum that would include special characters

Sure: just replace \w (word character) with . ("any character"):
RewriteRule ^([^.]+\.html). /$1 [R=301,L]

Leave off the closing anchor. (Was there a reason for it, buried somewhere in this thread?) The rule now says "get rid of anything after .html". It still doesn't get rid of query strings, but that would be a different issue anyway.

LilyTousi

7:49 pm on Mar 21, 2016 (gmt 0)

10+ Year Member



Thanks lucy24. It worked great! I was able to fix all the error404 in my Google Webmaster Tool.
I do not use query string in my sites.
Are you saying that this code would not work in this case : .html?abc
You are very helpful!

lucy24

8:24 pm on Mar 21, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are you saying that this code would not work in this case : .html?abc

Yes, exactly, because a RewriteRule "sees" only the path part of the URL. If you need to consider anything else-- protocol (http/https), hostname (with/without www), port number, query string, probably some others I've overlooked-- you need a RewriteCond. But you needn't bother with this unless you start seeing legitimate requests with attached query. Illegitimate requests, like malign robots asking for index.php?long-query-aimed-at-finding-and-exploiting-loopholes, obviously don't need to be considered.

Now, if you have requests with both problems-- garbage after the ".html", and then also a spurious query string-- you might choose to add a ? to the end of your target, meaning "also get rid of the query string". It won't do any harm. But, again, that's a pretty unlikely scenario.