|URLs in a loop with non-existent sub-directories|
I have no idea if this is being posted in the right part of the forum, because Iíve no idea what is causing it. But Iím assuming that it is a technology issue. Anyway, here goes...
We have moved a static website (plain html files) from one IIS server, to another. So far, everything seems fine... apart from the following.
As part of the testing process, automated link checkers appear to be getting caught up in a loop over a whole series of urls. All of these take the form of this;
And so on...
Then also within certain directories, e.g:
And so on and so on...
The filename.html (and other pages) are real pages, but obviously they only appear in one place - not in a sub-directory called Ď1í. In fact there isnít a subdirectory called Ď1í.
We have checked the whole site and cannot locate a root for this issue - rogue link or otherwise. There is certainly no link pointing towards a directory with /1/ in the URL.
Any ideas, suggestions as to where I might look would be most welcome. Thanks.
Also, the only other non-existent file which crops up in these tests is www.example.com/index.index.html
Which is also odd. www.example.com/index.html exists, so where that extra Ďindex.í is coming from is also a mystery.
There's only really a small number places that malformed URLs can come from.
1. For certain malformed request X, the pages of your site link to malformed URLs due to an error in the scripts that run your site. Xenu LinkSleuth can prove or disprove this very quickly.
2. An error in the server configuration means that for certain requests the server returns a redirect to a malformed URL. Xenu is of use here too.
3. You're using internal links in the site navigation in the form href="..\..\..\somepage.html" and some bot is misinterpreting them in some way.
4. There's some other site linking to malformed URLs due to either an error in the scripting of the other site or else a simple cut and paste error by someone posting links there.
5. A bot has a bug such that it requests non-valid URLs from your site, even when presented with valid links.
#6 A mistyped RegEx that says "1" when you meant to say "\1" or "$1" or "%1".
Been there. Done that.
Thanks for those.
OK, let me think...
1. Iíll try Xenu. I think unlikely, it being a static site. There is an internal site search engine, but that also ran on the old site, without an issue.
2. Will try Xenu...
3. The site is no open to the out site world, so far ĎSite Suckerí a Mac archiving (scraper if you like) we used has followed these links and the internal search engine has also indexed them.
4. As above, site is not currently public.
5. Ditto, I think.
6. Hmmmm. Not really sure what that means - Iím not really a coder.
Wherever it is coming from, it seems that any automated tool (our internal site search and our archiving tool) are following these links. Very odd.
Thanks again for the help. Iíll go and do some checking. This is hurting my head.
|6. Hmmmm. Not really sure what that means |
Cross it off the list, then, unless someone else has made you a custom htaccess. I'm thinking of redirect bloopers-- but those are more likely to end up in infinite loops, with google reporting "could not reach page" errors.
But "index.index.html" does seem to be the same kind of pattern. Does it happen only with the top-level index, or with other directory indexes as well? Does every single one of your internal links say correctly / alone, or has a stray "/index.html" sneaked in? A link checker will tell you, because you'll see it going to both pages-- or the same page twice, depending on how you look at it.
:: slinking off for long-overdue housekeeping of my own in /rats/ and /games/ directories to get rid of, ahem, /index.html links ::
Iíve run Link Sleuth and it says status ĎOK' - but has redirected to our custom 404 Error page.
[edited by: bouncybunny at 6:37 am (utc) on Dec 21, 2011]
Thanks lucy24. Iím pretty certain thereís only one index file.
<-- EDIT -->
OK Iíve cracked it.
There was a link in the custom error page and one other page, which were formatted as follows
Quite how I missed that, is anyoneís guess.
Thanks for the help, both of you. Much appreciated.
Seemingly complex issues often turn out to be a simple typo somewhere in the end.
Glad you fixed it.
Make sure that the duff URLs now return 404 Not Found or 410 Gone.
Minor typo in my post above: ..\..\..\ should have read ../../../
Thanks for your help g1smd.
Yep, they all return 404 now. Or at least according to the link checkers. Iíll check them with external sources once the site goes live.
So, in retrospect, I solved two problems in one. Had this not turned up, I might not have noticed the 404 issue until much later.