Welcome to WebmasterWorld Guest from 23.22.140.143

Forum Moderators: bakedjake

Message Too Old, No Replies

Instantly List All Outbound Links

more grep fun than allowed by law

     
5:52 pm on Jun 30, 2009 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


Instead of wasting time running some buggy link crawler on your site you can get a complete list of all absolute URLs sorted and duplicates removed in a single command.

grep -oh 'http://[^"]*' *.html ¦ sort ¦ uniq

The regex may need modifications if your HTML uses a single quote in HREF's instead of a double quote, but this is about the fastest way I know to get a complete list without banging on the website with some cumbersome tool.

If you have your html files in a bunch of subdirectories not to fear, recursion is here!

grep -ohr 'http://[^"]*' *.html ¦ sort ¦ uniq

Note that I added an "r" to the grep options so it will check all the files in subdirectories.

For a homework assignment, check to see if your links are valid using "curl" to visit each site and record the results. ;)

10:30 am on July 1, 2009 (gmt 0)

Senior Member from LK 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Nov 16, 2005
posts:2417
votes: 17


This only works if your site is static HTML - surely most sites are in databases these days.

The KDE link checker (the one that is packaged with Quanta) has worked fine for me.

1:19 pm on July 1, 2009 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


You would be amazed at the number of sites not in a database, or the smart sites that publish static pages from the database to avoid server overload during high demand.

Guess it won't help people with some blogs but others will find it handy.

5:20 pm on July 1, 2009 (gmt 0)

Senior Member from LK 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Nov 16, 2005
posts:2417
votes: 17


OK, that makes sense - assuming that is what you are doing (rather than, for example, using memcached or a reverse proxy (unless the proxy is caching to file I suppose?).

I ever really got the hang of these sorts of command lines. I find scripts easier to understand, although I guess that they are harder to adapt to changing needs.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members