homepage Welcome to WebmasterWorld Guest from 50.17.107.233
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Hardware and OS Related Technologies / Linux, Unix, and *nix like Operating Systems
Forum Library, Charter, Moderators: bakedjake

Linux, Unix, and *nix like Operating Systems Forum

    
Instantly List All Outbound Links
more grep fun than allowed by law
incrediBILL




msg:3943392
 5:52 pm on Jun 30, 2009 (gmt 0)

Instead of wasting time running some buggy link crawler on your site you can get a complete list of all absolute URLs sorted and duplicates removed in a single command.

grep -oh 'http://[^"]*' *.html ¦ sort ¦ uniq

The regex may need modifications if your HTML uses a single quote in HREF's instead of a double quote, but this is about the fastest way I know to get a complete list without banging on the website with some cumbersome tool.

If you have your html files in a bunch of subdirectories not to fear, recursion is here!

grep -ohr 'http://[^"]*' *.html ¦ sort ¦ uniq

Note that I added an "r" to the grep options so it will check all the files in subdirectories.

For a homework assignment, check to see if your links are valid using "curl" to visit each site and record the results. ;)

 

graeme_p




msg:3943964
 10:30 am on Jul 1, 2009 (gmt 0)

This only works if your site is static HTML - surely most sites are in databases these days.

The KDE link checker (the one that is packaged with Quanta) has worked fine for me.

incrediBILL




msg:3944074
 1:19 pm on Jul 1, 2009 (gmt 0)

You would be amazed at the number of sites not in a database, or the smart sites that publish static pages from the database to avoid server overload during high demand.

Guess it won't help people with some blogs but others will find it handy.

graeme_p




msg:3944199
 5:20 pm on Jul 1, 2009 (gmt 0)

OK, that makes sense - assuming that is what you are doing (rather than, for example, using memcached or a reverse proxy (unless the proxy is caching to file I suppose?).

I ever really got the hang of these sorts of command lines. I find scripts easier to understand, although I guess that they are harder to adapt to changing needs.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Hardware and OS Related Technologies / Linux, Unix, and *nix like Operating Systems
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved