Welcome to WebmasterWorld Guest from 54.144.79.200

Forum Moderators: bakedjake

Message Too Old, No Replies

Instantly List All Outbound Links

more grep fun than allowed by law

     

incrediBILL

5:52 pm on Jun 30, 2009 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Instead of wasting time running some buggy link crawler on your site you can get a complete list of all absolute URLs sorted and duplicates removed in a single command.

grep -oh 'http://[^"]*' *.html ¦ sort ¦ uniq

The regex may need modifications if your HTML uses a single quote in HREF's instead of a double quote, but this is about the fastest way I know to get a complete list without banging on the website with some cumbersome tool.

If you have your html files in a bunch of subdirectories not to fear, recursion is here!

grep -ohr 'http://[^"]*' *.html ¦ sort ¦ uniq

Note that I added an "r" to the grep options so it will check all the files in subdirectories.

For a homework assignment, check to see if your links are valid using "curl" to visit each site and record the results. ;)

graeme_p

10:30 am on Jul 1, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



This only works if your site is static HTML - surely most sites are in databases these days.

The KDE link checker (the one that is packaged with Quanta) has worked fine for me.

incrediBILL

1:19 pm on Jul 1, 2009 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



You would be amazed at the number of sites not in a database, or the smart sites that publish static pages from the database to avoid server overload during high demand.

Guess it won't help people with some blogs but others will find it handy.

graeme_p

5:20 pm on Jul 1, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



OK, that makes sense - assuming that is what you are doing (rather than, for example, using memcached or a reverse proxy (unless the proxy is caching to file I suppose?).

I ever really got the hang of these sorts of command lines. I find scripts easier to understand, although I guess that they are harder to adapt to changing needs.