Welcome to WebmasterWorld Guest from 23.20.75.214

Forum Moderators: martinibuster

Message Too Old, No Replies

Anyone know of crawler software that extracts web addresses?

     
1:33 pm on Mar 3, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 25, 2003
posts:2527
votes: 0


I'm looking for something that will crawl a given website and return all the outbound URls.

Either anything that is within ahref tags or any text that starts www.

Thanks
6:00 pm on Mar 3, 2014 (gmt 0)

Preferred Member from GB 

10+ Year Member Top Contributors Of The Month

joined:July 25, 2005
posts:399
votes: 11


Check this thread: [webmasterworld.com...]

User @affiliation has suggested a great free tool that does the job. You can set it to ignore the internal URLs - thus it will only return the outbound ones.
8:46 pm on Mar 3, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 25, 2003
posts:2527
votes: 0


I tried Xenu. It only scans one page. No matter what I do, (set to 999 depth) it only scans one page.

I figured it was broken?
9:16 pm on Mar 3, 2014 (gmt 0)

Preferred Member from GB 

10+ Year Member Top Contributors Of The Month

joined:July 25, 2005
posts:399
votes: 11


That's very unusual. Maybe the links are hidden via JavaScript? Although I doubt it. You can drop me some screenshots via Sticky and I'll have a look. I've been using it for longer than I care to remember and it's always worked great.
11:20 pm on Mar 3, 2014 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10553
votes: 13


is xenu finding any links on that page?

Either anything that is within ahref tags or any text that starts www.

xenu won't show you the (unanchored) url citations on a page so you would need a separate tool to find those.

the www. pattern won't be sufficient to capture all uris on the page as even sites that are canonicalized to the www.example.com hostname often are referred to without the www.