Wordpress, Bad URLs, and Duplicate Content

Forum Moderators: phranque

Message Too Old, No Replies

Wordpress, Bad URLs, and Duplicate Content

Wondering how to hack WP such that it doesn't allow bad URLs.

blumey

1:18 am on Jun 10, 2010 (gmt 0)

I'm having an issue with WordPress where a bad URL resolves and puts content from a recognized URL in its place.

Example:

ť GOOD URL
http://www.example.com/folder/this-is-my-post.html

ť BAD URL
http://www.example.com/bad-information/more-bad-information/folder/this-is-my-post.html

It looks like someone scraped my site, did some sloppy find/replacing, then uploaded it at an alternate location. In this instance, WP serves the content associated with the page even though the directories leading up to that page have never existed on the site. I assume the intention was to figure out which page the user is trying to access, then serve up the content no matter what the full URL is, but the end result is me with 5 pages of duplicate content, all resulting from bunk URLs.

How do I hack WP to prevent it from allowing pages to resolve with URLs that were never intended to exist in the first place?

g1smd

7:12 am on Jun 10, 2010 (gmt 0)

If there's a simple recognisable pattern to the unwanted URLs, and that pattern can *never* match a wanted URL, you can do this without touching the Wordpress code at all.

You can simply block those URLs from ever being accessible:

RewriteCond ^pattern-matching-unwanted-path$ - [G]

However, if your site has been hacked in some way, you really do need to change passwords, clean the database, and re-upload any hacked scripts as the perpetrator will likely be back again and again to do more damage.

blumey

2:45 pm on Jun 10, 2010 (gmt 0)

ť RE: "never match a wanted URL"

I can appreciate that fix, however, I'd prefer something that's a bit more proactive than reactive.

Let me be a bit more clear; the URLs look like your usual SPAM target keyword phrases:

(good)
http://www.example.com/legit-folder/this-is-my-post.html

(bad)
http://www.example.com/RolexWatches/Free//legit-folder/this-is-my-post.html
http://www.example.com/TrustWorthyViagra/legit-folder/this-is-my-post.html
http://www.example.com/SoftwareDiscounts/Microsoft/legit-folder/this-is-my-post.html

In all cases my post still displays, but I only want the first URL to work (I don't sell watches, viagra, or MS products.) While I can go and ban "SoftwareDicsounts", "TrustworthyViagra", and "RolexWatches" using my htaccess, there's nothing to prevent "FreeRolexWatches", "DiscountViagra", and "TrustworthySoftware" some appearing tomorrow. I'm looking for a solution that forces WP to say "I don't have a post with that exact URL, so I'm NOT going to serve up a page and 404 instead."

ť RE: "hacked"

I am very confident that my blog has not been hacked. The URL structure follows the patter of many scapers who pull down a copy of the page and modify all active links in an attempt to point the value elsewhere - the person who did this just did a very, very sloppy job. At worst, a person deliberately pointed links to my site with these fictitious URLs knowing that my blog would serve up the posts regardless the preceeding category structure in an attempt to create duplicte content for which I would be penalized.

In both situations, the optimal solution is one where WP is modified in some way to disallow folders that do not exist as pages in the pages section. Otherwise, I have to live in constant fear of someone doing this to me, and then deal with the pages as search engines cache them (my rankings suffering all the while.)

jdMorgan

3:46 pm on Jun 10, 2010 (gmt 0)

> the optimal solution is one where WP is modified in some way

This is really the only bulletproof solution, since only WP knows whether a URL resolves to a 'valid' page or not. The solution is to take the requested URL-path, look it up in the CMS database, and if it does not *exactly* match the stored title of an existing blog entry, then return a 404-Not Found.

Unfortunately, unless you want to code this patch and then re-install it every time your WP is upgraded, about all you can do is reject the requests based on their taxonomy. For example "//" appearing in the requested URL-path, or "too many directory levels" in the requested URL-path. You could look for those and reject them out-right:


RewriteCond %{REQUEST_URI} // [OR]
RewriteCond $1 ^([^/]*/){2,}
RewriteRule ^(([^/]*/)+[^.]+\.html)$ - [F]

Or, if "legit-folder" is a single folder or an easily-enumerable limited number of folders, then


RewriteCond $1 !^(legit-folder1|legit-folder2|legit-folder3)/$
RewriteRule ^(([^/]*/)+)[^.]+\.html$ - [F]

The real problem is identifying a sufficiently-selective approach that *will not* affect other valid requested URLs on your site --the ones not associated-with/handled-by WP-- but that *will* handle all or most of the bogus URL requests.

BTW, to change the 403-forbidden response to a 404-Not Found response, simply change the substitution path to point to a filepath that you know does not (and will never) exist. For example, the first RewriteRule line above becomes:

 RewriteRule ^(([^/]*/)+[^.]+\.html)$ /nonexistent-file-path.hmtl [L]

This 404-invocation method works on all versions of Apache. You could also use "RewriteRule ^(([^/]*/)+[^.]+\.html)$ - [R=404,L]" on Apache 2.0 and later.

These snippets are intended only as examples. Again, making the rules and conditions sufficiently-selective for *your site* is the challenge...

Jim