homepage Welcome to WebmasterWorld Guest from 54.211.157.103
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Hardware and OS Related Technologies / Website Technology Issues
Forum Library, Charter, Moderators: phranque

Website Technology Issues Forum

    
URLs in a loop with non-existent sub-directories
bouncybunny




msg:4399895
 10:41 pm on Dec 20, 2011 (gmt 0)

I have no idea if this is being posted in the right part of the forum, because Iíve no idea what is causing it. But Iím assuming that it is a technology issue. Anyway, here goes...

We have moved a static website (plain html files) from one IIS server, to another. So far, everything seems fine... apart from the following.

As part of the testing process, automated link checkers appear to be getting caught up in a loop over a whole series of urls. All of these take the form of this;

www.example.com/1/filename.html
www.example.com/1/1/filename.html
www.example.com/1/1/1/filename.html
www.example.com/1/1/1/1/filename.html

And so on...

Then also within certain directories, e.g:

www.example.com/subdirectory/1/filename.html
www.example.com/subdirectory/1/1/filename.html
www.example.com/subdirectory/1/1/1/filename.html
www.example.com/subdirectory/1/1/1/1/filename.html

And so on and so on...

The filename.html (and other pages) are real pages, but obviously they only appear in one place - not in a sub-directory called Ď1í. In fact there isnít a subdirectory called Ď1í.

We have checked the whole site and cannot locate a root for this issue - rogue link or otherwise. There is certainly no link pointing towards a directory with /1/ in the URL.

Any ideas, suggestions as to where I might look would be most welcome. Thanks.

 

bouncybunny




msg:4399907
 11:18 pm on Dec 20, 2011 (gmt 0)

Also, the only other non-existent file which crops up in these tests is www.example.com/index.index.html

Which is also odd. www.example.com/index.html exists, so where that extra Ďindex.í is coming from is also a mystery.

g1smd




msg:4399921
 12:10 am on Dec 21, 2011 (gmt 0)

There's only really a small number places that malformed URLs can come from.

1. For certain malformed request X, the pages of your site link to malformed URLs due to an error in the scripts that run your site. Xenu LinkSleuth can prove or disprove this very quickly.

2. An error in the server configuration means that for certain requests the server returns a redirect to a malformed URL. Xenu is of use here too.

3. You're using internal links in the site navigation in the form href="..\..\..\somepage.html" and some bot is misinterpreting them in some way.

4. There's some other site linking to malformed URLs due to either an error in the scripting of the other site or else a simple cut and paste error by someone posting links there.

5. A bot has a bug such that it requests non-valid URLs from your site, even when presented with valid links.

lucy24




msg:4399926
 12:19 am on Dec 21, 2011 (gmt 0)

#6 A mistyped RegEx that says "1" when you meant to say "\1" or "$1" or "%1".

Been there. Done that.

bouncybunny




msg:4399949
 5:04 am on Dec 21, 2011 (gmt 0)

Thanks for those.

OK, let me think...

1. Iíll try Xenu. I think unlikely, it being a static site. There is an internal site search engine, but that also ran on the old site, without an issue.

2. Will try Xenu...

3. The site is no open to the out site world, so far ĎSite Suckerí a Mac archiving (scraper if you like) we used has followed these links and the internal search engine has also indexed them.

4. As above, site is not currently public.

5. Ditto, I think.

6. Hmmmm. Not really sure what that means - Iím not really a coder.

Wherever it is coming from, it seems that any automated tool (our internal site search and our archiving tool) are following these links. Very odd.

Thanks again for the help. Iíll go and do some checking. This is hurting my head.

lucy24




msg:4399961
 5:39 am on Dec 21, 2011 (gmt 0)

6. Hmmmm. Not really sure what that means

Cross it off the list, then, unless someone else has made you a custom htaccess. I'm thinking of redirect bloopers-- but those are more likely to end up in infinite loops, with google reporting "could not reach page" errors.

But "index.index.html" does seem to be the same kind of pattern. Does it happen only with the top-level index, or with other directory indexes as well? Does every single one of your internal links say correctly / alone, or has a stray "/index.html" sneaked in? A link checker will tell you, because you'll see it going to both pages-- or the same page twice, depending on how you look at it.


:: slinking off for long-overdue housekeeping of my own in /rats/ and /games/ directories to get rid of, ahem, /index.html links ::

bouncybunny




msg:4399966
 6:01 am on Dec 21, 2011 (gmt 0)

Gah.

Iíve run Link Sleuth and it says status ĎOK' - but has redirected to our custom 404 Error page.

[edited by: bouncybunny at 6:37 am (utc) on Dec 21, 2011]

bouncybunny




msg:4399967
 6:10 am on Dec 21, 2011 (gmt 0)

Thanks lucy24. Iím pretty certain thereís only one index file.

<-- EDIT -->

OK Iíve cracked it.

There was a link in the custom error page and one other page, which were formatted as follows
<a href="1/filename.html">link</a>.

Quite how I missed that, is anyoneís guess.

Thanks for the help, both of you. Much appreciated.

g1smd




msg:4399983
 7:44 am on Dec 21, 2011 (gmt 0)

Seemingly complex issues often turn out to be a simple typo somewhere in the end.

Glad you fixed it.

Make sure that the duff URLs now return 404 Not Found or 410 Gone.

Minor typo in my post above: ..\..\..\ should have read ../../../

bouncybunny




msg:4400382
 4:28 am on Dec 22, 2011 (gmt 0)

Thanks for your help g1smd.

Yep, they all return 404 now. Or at least according to the link checkers. Iíll check them with external sources once the site goes live.

So, in retrospect, I solved two problems in one. Had this not turned up, I might not have noticed the 404 issue until much later.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Hardware and OS Related Technologies / Website Technology Issues
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved