Welcome to WebmasterWorld Guest from 3.229.122.166

Forum Moderators: goodroi

Message Too Old, No Replies

Weird, compounded URLs in webmaster tools

..and how can I stop this?

     
6:35 pm on Mar 3, 2015 (gmt 0)

New User

joined:Mar 3, 2015
posts: 25
votes: 2


Hi,

I have a tonne of nonsensical pages being indexed by Google. I can see them in webmaster tools.. They're like compounds of my other URLs. Stuff like:

www.mysite.com/correct-page.php/some-other-page.php/even-more-stuff.php

I recognize all the individual pages but they're all independent and shouldn't be stacking like that. It's leading to annoying things like Google complaining about duplicate meta data, and site-search results that are duplicated and go on forever (large site.. this makes it exponentially larger).

My best guess was that apparently search engines can be confused by relative links, so I went in a few years ago and switched everything I could over to an absolute link. But there were a few things that I couldn't figure out how to switch without breaking, so I know there's still some relative links hanging around.

Is this what would be causing it?
Any other theories?
What can be done?
Someone once mentioned that weird URLs like this should give errors, but they still yield a page, which is I guess why Google keeps the results logged. How can I stop these from resolving?
7:13 pm on Mar 3, 2015 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4458
votes: 331


What do you see if you visit one of these nonsense URLs? Do you have header checking tools such as HTTP Live Headers for Firefox? That would let you see the request/response trail when such an URL is requested. When you know the cause, you can usually find a solution.
8:11 pm on Mar 3, 2015 (gmt 0)

New User

joined:Mar 3, 2015
posts: 25
votes: 2


I see page that first ends in .php
So, in the example above, I would see the above URL in the window, and be looking at the content from:

www.mysite.com/correct-page.php

..oh, but... with the comments block from "even-more-stuff.php"... I guess that's a hint in itself potentially? (they're Flash game pages with user comments below)

I don't have the header checking tools, but I'll look into that right now, thank you =)
8:28 pm on Mar 3, 2015 (gmt 0)

New User

joined:Mar 3, 2015
posts: 25
votes: 2


Okay, I got the tool, and saved the header info.
What in particular would I be looking for? There's an overwhelming lot of stuff there haha...
8:34 pm on Mar 3, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15805
votes: 846


How can I stop these from resolving?

Details will depend on how the site is physically constructed: hand-coded php, CMS, and so on.

Do your real URLs ever really have any content (other than a query string) after the ".php" element? If no, there are a couple of different routes to take. You could change your IgnorePathInfo settings, or serve a 410, or issue a redirect.

It generally isn't worth the trouble of figuring out why G### thinks such-and-such bogus URL exists. Obviously you need to make sure there isn't anything wrong in your own code. But beyond that, just chalk it up to the search engine's fevered imagination. If you look closely in GWT, you can sometimes get them to tell you why they think a particular URL exists. But you may end up with "In sitemap", meaning a sitemap they read in 2012, which gets you no further.
9:02 pm on Mar 3, 2015 (gmt 0)

New User

joined:Mar 3, 2015
posts: 25
votes: 2


The site is hand-coded, by my ex developer.

Some of the pages have page numbers after the .php (?page_num=1)
Does that squash options?

I've never used a site map. Maybe I should start? But that probably wouldn't resolve this particular issue.

I guess I can live without knowing *why* these pages exist haha.. My two big worries are that it really messes up the site search for the user, and I worry whether I would have SEO repercussions from the "duplicate" data, meta data, etc...
9:26 pm on Mar 3, 2015 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4458
votes: 331


What in particular would I be looking for? There's an overwhelming lot of stuff there

Those are each request for some part of your page, that is where you can "see" the browser requesting an URL (for a page or a file, an image, a script, etc.) and the responses from your server (200="OK", 404="Not Found", 301="Permanent Redirect", etc.) so that you can see where these URLs are coming from and how the server responds. Often it can help you spot where things are going wrong, or things happening that shouldn't. It does take some patience to slog through, but it helps narrow down the source of some problems.
9:48 pm on Mar 3, 2015 (gmt 0)

New User

joined:Mar 3, 2015
posts: 25
votes: 2


There isn't a single 404 in the thing.. One 301 but for an external widget thing that I'm sure isn't related. Virtually everything comes back as 200. It requests these bogus things from bogus locations and then somehow fetches them successfully O.o

Eg:
http:// www.example.com/maker.php/games/games/images/spacer.gif

GET /maker.php/games/games/images/spacer.gif HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http:// www. example.com/maker.php/games/games/avatar-maker.php
Cookie: __cfduid=cd1; __utma=484; _ga=GA1.2.2; __qca=P0-20; ads_bm_last_load_status=BLOCKING; example_cookie_a=J41; example_cookie_b=ksd; example_cookie_c=rs8; PHPSESSID=780e; _gat=1; bm_monthly_unique=true; bm_daily_unique=true; bm_sample_frequency=1; ads_bm_daily_shown_ad=true; ads_bm_monthly_shown_ad=true; MarketGidStorage=%7B%220; bm_last_load_status=BLOCKING
Connection: keep-alive

HTTP/1.1 200 OK




(actual location of said image is:
http:// www.example.com/images/spacer.gif)

[edited by: goodroi at 11:25 pm (utc) on Mar 3, 2015]
[edit reason] Examplified [/edit]

10:52 pm on Mar 3, 2015 (gmt 0)

New User

joined:Mar 3, 2015
posts: 25
votes: 2


Hmm... interesting...

I tried doing the same compound thing on a friend's site, running the header tool, and it was acting totally the same way as mine (resolving when it shouldn't, but wonky).

Then I tried it on *another* friend's site and it behaved as one would like: went to a 404 page. The only difference I can see in the header tool is instead of a 200, it yielded:

HTTP/1.1 302 Moved Temporarily
...and got redirected to the 404 error page on the site.

Where would be the settings to control how it handles these pages? .htaccess?
I'm so jealous.. I want mine yielding this 302... >_<
11:05 pm on Mar 3, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15805
votes: 846


Some of the pages have page numbers after the .php (?page_num=1)
Does that squash options?

No, that doesn't matter at all. Anything after a ? is a query string; it doesn't affect the path.

PathInfo approach (my bad, I said "IgnorePathInfo" earlier but the command is actually "AcceptPathInfo":
AcceptPathInfo Off
means that any request with stuff after the extension will lead to a 404 page. This is probably not the route you want to take.

Look-at-the-URL approach:
Assuming for the sake of discussion that your URLs do not contain literal periods (it is perfectly legal, but rules are simpler when they don't, so let's take the easy way first)

RewriteRule ^([^.]+\.php). http://www.example.com/$1 [R=301,L]

OR

RewriteRule \.php. - [G]

Those are two different responses and it's entirely up to you. First version: capture everything up through the extension. If there's additional stuff afterward, redirect to the form without the additional stuff. Second version: if there's stuff after the extension, send a 410 "It ain't here no more" response. You can also manually return a 404 using the locution [R=404,L] but if you use 410 the googlebot will go away faster.

Again: "stuff after the extension" applies only to the path. Query strings don't matter (and, if present, will be reappended to the redirect target).

Edit because we overlapped:
got redirected to the 404 error page on the site.

Well, that just means your friend's site is badly coded :) Crystal ball says their ErrorDocument line-- assuming apache-- contains the full domain name instead of beginning in / for root.

I want mine yielding this 302.

No, you don't. You may choose to return a 301 (earlier in this post) but a 302 should be reserved for redirects that really are temporary, as when a brick-and-mortar business sends you to a different entrance while they're remodeling.
11:34 pm on Mar 3, 2015 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


But there were a few things that I couldn't figure out how to switch without breaking

This is probably why you still have it.

Why could not you switch these? What would have broken?
3:28 am on Mar 4, 2015 (gmt 0)

New User

joined:Mar 3, 2015
posts: 25
votes: 2


Why could not you switch these? What would have broken?

The site is quite complex so once I got to the parts controlling user logins, user gallery controls and stuff like that things started to break. I probably didn't know just how many folders up to go etc. There's a lot of folders and subfolders involved in organizing that stuff so I think that portion was beyond me.

But this being said.. I've taken another look at the "duplicate title tag" list in GWT, and I've noticed that none of the *newer* games are appearing there. Maybe the changes I made before really did stop new errors from piling up? These old ones have been there forever though so maybe I assumed nothing had changed. Maybe the issue is resolved? (or at least not getting worse haha)...
3:30 am on Mar 4, 2015 (gmt 0)

New User

joined:Mar 3, 2015
posts: 25
votes: 2


Those are two different responses and it's entirely up to you. First version: capture everything up through the extension. If there's additional stuff afterward, redirect to the form without the additional stuff. Second version: if there's stuff after the extension, send a 410 "It ain't here no more" response. You can also manually return a 404 using the locution [R=404,L] but if you use 410 the googlebot will go away faster.

Wow, thank you! I will see if I can make this work =)
Anything to slay the googlebot!

No, you don't. You may choose to return a 301 (earlier in this post) but a 302 should be reserved for redirects that really are temporary, as when a brick-and-mortar business sends you to a different entrance while they're remodeling.

Hahaha, oh okay! Good to know =)
Thank you!
3:49 am on Mar 4, 2015 (gmt 0)

New User

joined:Mar 3, 2015
posts: 25
votes: 2


RewriteRule ^([^.]+\.php). http://www.example.com/$1 [R=301,L]

OR

RewriteRule \.php. - [G]


Wow, these both work like a charm, thanks!
I've been debating which I should choose.. Since, as far as I know, this has been more of a bot problem than a user problem, I'm gonna go with the second one and really try to clean up this mess.

"GONE
The requested resource is no longer available on this server and there is no forwarding address"
...I've never seen something more beautiful, lol!
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members