Page is a not externally linkable
- Hardware and OS Related Technologies
-- Linux, Unix, and *nix like Operating Systems
---- Instantly List All Outbound Links


incrediBILL - 5:52 pm on Jun 30, 2009 (gmt 0)


Instead of wasting time running some buggy link crawler on your site you can get a complete list of all absolute URLs sorted and duplicates removed in a single command.

grep -oh 'http://[^"]*' *.html ¦ sort ¦ uniq

The regex may need modifications if your HTML uses a single quote in HREF's instead of a double quote, but this is about the fastest way I know to get a complete list without banging on the website with some cumbersome tool.

If you have your html files in a bunch of subdirectories not to fear, recursion is here!

grep -ohr 'http://[^"]*' *.html ¦ sort ¦ uniq

Note that I added an "r" to the grep options so it will check all the files in subdirectories.

For a homework assignment, check to see if your links are valid using "curl" to visit each site and record the results. ;)


Thread source:: http://www.webmasterworld.com/linux/3943390.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com