homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Hardware and OS Related Technologies / Linux, Unix, and *nix like Operating Systems
Forum Library, Charter, Moderators: bakedjake

Linux, Unix, and *nix like Operating Systems Forum

Instantly List All Outbound Links
more grep fun than allowed by law

 5:52 pm on Jun 30, 2009 (gmt 0)

Instead of wasting time running some buggy link crawler on your site you can get a complete list of all absolute URLs sorted and duplicates removed in a single command.

grep -oh 'http://[^"]*' *.html ¦ sort ¦ uniq

The regex may need modifications if your HTML uses a single quote in HREF's instead of a double quote, but this is about the fastest way I know to get a complete list without banging on the website with some cumbersome tool.

If you have your html files in a bunch of subdirectories not to fear, recursion is here!

grep -ohr 'http://[^"]*' *.html ¦ sort ¦ uniq

Note that I added an "r" to the grep options so it will check all the files in subdirectories.

For a homework assignment, check to see if your links are valid using "curl" to visit each site and record the results. ;)



 10:30 am on Jul 1, 2009 (gmt 0)

This only works if your site is static HTML - surely most sites are in databases these days.

The KDE link checker (the one that is packaged with Quanta) has worked fine for me.


 1:19 pm on Jul 1, 2009 (gmt 0)

You would be amazed at the number of sites not in a database, or the smart sites that publish static pages from the database to avoid server overload during high demand.

Guess it won't help people with some blogs but others will find it handy.


 5:20 pm on Jul 1, 2009 (gmt 0)

OK, that makes sense - assuming that is what you are doing (rather than, for example, using memcached or a reverse proxy (unless the proxy is caching to file I suppose?).

I ever really got the hang of these sorts of command lines. I find scripts easier to understand, although I guess that they are harder to adapt to changing needs.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Hardware and OS Related Technologies / Linux, Unix, and *nix like Operating Systems
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved