Page is a not externally linkable
incrediBILL - 5:52 pm on Jun 30, 2009 (gmt 0)
grep -oh 'http://[^"]*' *.html ¦ sort ¦ uniq The regex may need modifications if your HTML uses a single quote in HREF's instead of a double quote, but this is about the fastest way I know to get a complete list without banging on the website with some cumbersome tool. If you have your html files in a bunch of subdirectories not to fear, recursion is here! grep -ohr 'http://[^"]*' *.html ¦ sort ¦ uniq Note that I added an "r" to the grep options so it will check all the files in subdirectories. For a homework assignment, check to see if your links are valid using "curl" to visit each site and record the results. ;)
Instead of wasting time running some buggy link crawler on your site you can get a complete list of all absolute URLs sorted and duplicates removed in a single command.