Page is a not externally linkable
incrediBILL - 12:30 am on Feb 22, 2007 (gmt 0)
I know this isn't just a problem I was having as I checked a few of my competitors and their sites were (and still are) filthy with bad links. With tens of thousands of links in a directory, manually re-checking each link one at a time is out of the question. The solution? Write your own custom link checker on steroids. We're talking about getting into the guts of the HTTP protocol here, looking at headers, as you need to be in the loop on every little detail, including intermediate redirect locations, and much more in order to effectively identify bad listings and truly clean up your directory. Don't panic as this isn't as complicated as it sounds and you Linux people can easily use CURL to get a page complete with headers to examine. The biggest challenge is processing redirects one at a time opposed to letting CURL just follow the redirects itself, which is an option. Sometimes you get text back with the redirect which can be useful, opposed to letting CURL just follow the redirects which returns all the intermediate headers but the final page content only so you miss intermediate messages. However, when you run into ASPSESSION cookies letting CURL follow redirects might save you a few hours of frustration and pulled out hair. Some things to check for: 1. Soft 404s or redirected 404's that ultimately return a "200 OK". Some servers still return the good old hard 404 error but more often than not, especially with free hosting or some servers managed with control panels, you don't get hard 404's. This means you have to build up a large list of finger prints from what is returned in the page content, titles and HTTP headers (in redirects) to indentify these soft 404's. 2. Redirects to landing pages, a key to identifying many domain parks. Domain parks tend to have a centralized processing of all the domains they control and will ultimately redirect to the final destination after passing thru some easily identifiable locations that are unique to those types of operations. Keep in mind that not all landing pages are permanent (or bad) as domain registrars will temporarily park sites being moved, so those should be put on hold. 3. Check the resulting text page for finger prints that identify domain parks, scrapers, MFA pages, sites that are INFECTED with malware and more. This is the complicated part as some things that initially look like obvious finger prints turn out to generate false positives and flag good sites. Needless to say you have to use some caution building this list and spot check sites that return a positive for each finger print before relying on it blindly. 4. Endless redirects, I stop at 10. Some sites get temporarily broken by people monkeying with their .htaccess files that loop out of control. Maybe they will eventually fix the problem, but until they do, put it on hold. As a fail safe, just to make sure my code isn't flawed, I pass sites that loop to CURL with the option to follow the redirects and see if CURL can actually get to a page. If CURL fails as well it's put on hold. 5. Timeouts, sites that flat don't respond but are still registered domains located on an actual server. Don't care why it isn't responding, but it's put on hold until they get it together and start serving pages. 6. Unregistered sites. If you get an error check the WHOIS and see if it's no longer registered which is a good clue you can dump it as it's not even in renewal, it's just gone. FWIW, if you have a fairly large directory you will build hundreds of finger prints to flag bad sites in short order. Now that you have a super duper link checker, what do you do with the sites you find that are broken? Here's what I do with the sites that fail the link checker and are put on hold: A. Sites that fail the link checker are grouped by type of failure and quarantined for later review. These sites are spot checked to make sure the link checker isn't getting false positives. B. Periodically re-scan all quarantined sites in a couple of weeks or monthly and put sites that respond as active into a manual review queue. C. Review sites flagged for re-inclusion manually before releasing to make sure that the domain park or whatever didn't just change their finger print as this happens all the time, and to make sure sites flagged with malware have been truly cleaned up. D. 90 days cut-off. Anything held in quarantine for more than 90 days that hasn't become active again, or the malware isn't removed, is unceremoniously dumped. That's the basic idea of what I'm doing, mostly automated with some human review. Building a custom link checker and finger printing pages may sound like a lot of work but in the end if pays off massively. You can clean your directory automatically and effortlessly, over and over again, and in the process make sure your visitors have an excellent experience on your site. The best part is your competitors will still serve up bad pages and as visitor frustration mounts they will abandon those sites and your traffic will increase.
The problem I've been running into for quite some time now is that thousands of sites, both old and new, succumb to the domain park crowd and worse. One day you have a link to "XYZ Plumbing" and the next day it's either a domain park site, porn redirect, or <shudders> a ringtone site.