|Is there something like Xenu, for mac?|
| 9:27 pm on Dec 15, 2011 (gmt 0)|
I'm looking for a decent link checker. Something that will crawl an entire site, and report back with URLs found, HTTP status, and whatnot.
something just like Xenu would be perfect. I need only a few options, like the ability to not follow external links, exclude files with regex.
| 9:28 pm on Dec 15, 2011 (gmt 0)|
oh, there's one called Integrity;
the last time I tried that one it crashed the macbook. I'm not kidding
| 9:41 pm on Dec 15, 2011 (gmt 0)|
I have used Integrity to check a 10K+ page website with no problems, you may have run into a memory issue.
I've also used BLT and there is another called DeepTrawl
| 11:24 pm on Dec 15, 2011 (gmt 0)|
A few years ago I was forced to install the w3c link checker locally because it was getting snarky about checking fragments in multiple files. I don't know how well you get along with Terminal and command-line input. It was horribly traumatic for me, but you only have to do it once.* The drawback is that by default the local version thinks your dtd is a link. (Someone told me how to override it, but my brain tends to shut down when I'm given command-line information.) The CGI part is separate.
* Except when they go and update it, leaving me no choice but to, uhm, ignore the update.
| 11:38 am on Dec 19, 2011 (gmt 0)|
Follow-up: OK, I gritted my teeth and read the manual. Well, the Help screen. I never knew it existed until I tried to use a command someone else told me about. Oops.
w3c link checker, installed locally, command-line interface. For a whole site:
checklink -X http://www.w3.org/TR/html4/loose.dtd -l http://www.example.com/ http://www.example.com/
-X = exclude. For html4/loose etc, substitute whatever your own DTD says. You can use a regex. Pile on further -X as needed. You have to include the DTD line because the local version doesn't ignore it the way the online version does, which means that every single page on your site will wait an extra 15 seconds before spitting out a 500 error.
-l (that's ell, not Eye) = constrain recursive searches to this location (here a whole domain)
The repetition of www.example.com is not a typo. The first one goes with -l. The second one is the actual page you're checking. By specifying a location you've made it infinitely recursive.
Once it has started, go out to dinner. Or leave on vacation, depending on how big your site is. The Link Checker is a Good Robot, so its default minimum time between requests is 1 second. You can make it longer but not shorter. My site took about 45 minutes. Oh, and don't forget to say
in your robots.txt. I did say it's a Good Robot ;)