Msg#: 3943390 posted 5:52 pm on Jun 30, 2009 (gmt 0)
Instead of wasting time running some buggy link crawler on your site you can get a complete list of all absolute URLs sorted and duplicates removed in a single command.
grep -oh 'http://[^"]*' *.html ¦ sort ¦ uniq
The regex may need modifications if your HTML uses a single quote in HREF's instead of a double quote, but this is about the fastest way I know to get a complete list without banging on the website with some cumbersome tool.
If you have your html files in a bunch of subdirectories not to fear, recursion is here!
grep -ohr 'http://[^"]*' *.html ¦ sort ¦ uniq
Note that I added an "r" to the grep options so it will check all the files in subdirectories.
For a homework assignment, check to see if your links are valid using "curl" to visit each site and record the results. ;)