Software that determines whether Links return 404s

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Software that determines whether Links return 404s

MichaelW

9:34 am on Nov 28, 2014 (gmt 0)

Is there software that will check all the links into a site, and determine whether these return a 404?

thanks.

aakk9999

4:20 pm on Nov 28, 2014 (gmt 0)

Yes, there are numerous tools like this. We had threads on tools and you can find references and descriptions on what tools do in these threads.

2013 Favourite SEO Tools [webmasterworld.com] - the first tool listed in this thread (Screaming Frog) will do what you want, but the free version crawls only the first 500 URLs.

Favourite SEO Tools [webmasterworld.com] - in this thread, the last tool that tedster mentions (AuditMyPc) will also do what you want, and the last time I looked (a few years back) it was free.

There could be other tools listed in these two threads that will do what you want, but the two mentioned I used myself in the past.

MichaelW

4:32 pm on Nov 28, 2014 (gmt 0)

Hi, I think I meant to say what software will get all the links, then get all the target URLs, and then check whether these target URLs are 404s or not.

I'm not sure ScreamingFrog does this.

I'm guessing I could get all the target links from Majestic and then run then through list mode in ScreamingFrog to check whether they are 404s. Though I was looking for something that it all steps in one process?

thanks

netmeg

4:45 pm on Nov 28, 2014 (gmt 0)

Are you saying you're trying to find which of your incoming links are 404ing? You can probably parse that out of your log file. But yes, you can also feed Screaming Frog a file full of URLs and it will check for 404s.

aakk9999

4:50 pm on Nov 28, 2014 (gmt 0)

Ah, understand now. As netmeg says, your server logs would be the best bet.

I presume you have already taken the list of URLs from WMT errors section? (although these will be in your server logs too)

netmeg

8:00 pm on Nov 28, 2014 (gmt 0)

I'm presuming he wants to find any incoming link that 404s, and not just the ones that people actually try to come in on.

lucy24

8:43 pm on Nov 28, 2014 (gmt 0)

not just the ones that people actually try to come in on

But that goes beyond a tool you can run yourself, and into the realm of (paid) services. Which in turn gets into "a service is only as good as its robot". If GreatSite links to you, but they've blocked UsefulToolBot for whatever reason, then UsefulTool will never be able to tell you if GreatSite's links are valid.

In practice, though, the links people are actually using are the ones you really need to know about. That's where you open up your raw logs in a text editor and search using the appropriate Regular Expression such as (for Apache)

(GET|HEAD) \S+ HTTP/1\.[01]" 4(04|10) \d+ "http

This will also bring up any 404s whose referer happens to be your own site-- but that's just as well, since you would certainly want to know about those! I'd include both GET and HEAD, because a HEAD request may be someone else's link checker at work, and those are the ones most likely to respond to a "can you please fix this?" request.

As long as you're in there, you might throw 301 into the mix to pick up the valid-but-imperfect links.

[edited by: aakk9999 at 12:58 am (utc) on Nov 30, 2014]

MichaelW

10:17 pm on Nov 28, 2014 (gmt 0)

Hmm, looking through the logs files seems timeconsuming.

I was thinking of digging around for the odd link decent link, that gets minimal traffic and might be hard to find in log files.

Majestic and ScreamingFrog may be the best option.

thank you.

lucy24

10:33 pm on Nov 28, 2014 (gmt 0)

looking through the log files seems timeconsuming

Not unless you've got so many sites, and they are all so heavily trafficked, that the mere act of doing a multi-file search will tie up your computer for hours. It took me about 30 seconds in TextWrangler (a few hundred, maybe 1-2000, small files, including zipped archives). I'm pretty sure the number of separate files is a much bigger factor than the size of individual files, so if you constrain it to just the past month or so, the time investment pretty well disappears.

That's assuming you keep the raw logs somewhere. You should, just as a matter of habit, even if you don't normally do anything with them.

MichaelW

10:36 pm on Nov 28, 2014 (gmt 0)

ok Lucy24 great thanks for this advice, I'll look at TextWrangler as well.

Kendo

11:23 pm on Nov 28, 2014 (gmt 0)

I have been using XENU for eons.

Clay_More

11:52 pm on Nov 28, 2014 (gmt 0)

I also use XENU, but it will only check links that are part of your site or link out from your site. To catch the odd inbound to a page that does not exist I'll usually look at logs.

Added: I also have a page set to no-index that only has one in content outbound link to my home page. I'll often redirect malformed inbound links to pages that do not exist to that no index page.

lucy24

4:13 am on Nov 29, 2014 (gmt 0)

I'll often redirect malformed inbound links to pages that do not exist to that no index page.

You mean, just as a simple way of keeping track (because a 301 preserves the original referer)? Makes sense. But is there any way to distinguish between honest erroneous links, and spurious robotic requests that slap on a referer in hopes of getting past some barriers? I mean, other than excluding requests for /wp-admin/ and similar.

Clay_More

6:13 pm on Nov 29, 2014 (gmt 0)

The page was originally set up for bots. I don't have much of an issue with www vs non-www on this site, but a lot of bot traffic was coming in to example.com pages where canonical was www.example.com

I removed my htacess section setting canonical and redirected instead to that page with the one link to the canonical home page for any stray humans. I've found it helpful.

MichaelW

6:14 am on Nov 30, 2014 (gmt 0)

Ok using Majestic to find target URLs, Open Office to filter out the duplicates and then running the text file of the results through Screaming Frog works grand.
I do have to cross reference the 404s with the filtered results and the 'Source_URL' in Open Office.

It would be nice to have software that did it all in go step.