Welcome to WebmasterWorld Guest from 188.8.131.52
I know this isn't just a problem I was having as I checked a few of my competitors and their sites were (and still are) filthy with bad links.
With tens of thousands of links in a directory, manually re-checking each link one at a time is out of the question.
Write your own custom link checker on steroids.
We're talking about getting into the guts of the HTTP protocol here, looking at headers, as you need to be in the loop on every little detail, including intermediate redirect locations, and much more in order to effectively identify bad listings and truly clean up your directory.
Don't panic as this isn't as complicated as it sounds and you Linux people can easily use CURL to get a page complete with headers to examine. The biggest challenge is processing redirects one at a time opposed to letting CURL just follow the redirects itself, which is an option. Sometimes you get text back with the redirect which can be useful, opposed to letting CURL just follow the redirects which returns all the intermediate headers but the final page content only so you miss intermediate messages. However, when you run into ASPSESSION cookies letting CURL follow redirects might save you a few hours of frustration and pulled out hair.
Some things to check for:
1. Soft 404s or redirected 404's that ultimately return a "200 OK". Some servers still return the good old hard 404 error but more often than not, especially with free hosting or some servers managed with control panels, you don't get hard 404's. This means you have to build up a large list of finger prints from what is returned in the page content, titles and HTTP headers (in redirects) to indentify these soft 404's.
2. Redirects to landing pages, a key to identifying many domain parks. Domain parks tend to have a centralized processing of all the domains they control and will ultimately redirect to the final destination after passing thru some easily identifiable locations that are unique to those types of operations. Keep in mind that not all landing pages are permanent (or bad) as domain registrars will temporarily park sites being moved, so those should be put on hold.
3. Check the resulting text page for finger prints that identify domain parks, scrapers, MFA pages, sites that are INFECTED with malware and more. This is the complicated part as some things that initially look like obvious finger prints turn out to generate false positives and flag good sites. Needless to say you have to use some caution building this list and spot check sites that return a positive for each finger print before relying on it blindly.
4. Endless redirects, I stop at 10. Some sites get temporarily broken by people monkeying with their .htaccess files that loop out of control. Maybe they will eventually fix the problem, but until they do, put it on hold. As a fail safe, just to make sure my code isn't flawed, I pass sites that loop to CURL with the option to follow the redirects and see if CURL can actually get to a page. If CURL fails as well it's put on hold.
5. Timeouts, sites that flat don't respond but are still registered domains located on an actual server. Don't care why it isn't responding, but it's put on hold until they get it together and start serving pages.
6. Unregistered sites. If you get an error check the WHOIS and see if it's no longer registered which is a good clue you can dump it as it's not even in renewal, it's just gone.
FWIW, if you have a fairly large directory you will build hundreds of finger prints to flag bad sites in short order.
Now that you have a super duper link checker, what do you do with the sites you find that are broken?
Here's what I do with the sites that fail the link checker and are put on hold:
A. Sites that fail the link checker are grouped by type of failure and quarantined for later review. These sites are spot checked to make sure the link checker isn't getting false positives.
B. Periodically re-scan all quarantined sites in a couple of weeks or monthly and put sites that respond as active into a manual review queue.
C. Review sites flagged for re-inclusion manually before releasing to make sure that the domain park or whatever didn't just change their finger print as this happens all the time, and to make sure sites flagged with malware have been truly cleaned up.
D. 90 days cut-off. Anything held in quarantine for more than 90 days that hasn't become active again, or the malware isn't removed, is unceremoniously dumped.
That's the basic idea of what I'm doing, mostly automated with some human review.
Building a custom link checker and finger printing pages may sound like a lot of work but in the end if pays off massively. You can clean your directory automatically and effortlessly, over and over again, and in the process make sure your visitors have an excellent experience on your site.
The best part is your competitors will still serve up bad pages and as visitor frustration mounts they will abandon those sites and your traffic will increase.
That right there is the big reason why not to submit to thousands of directories. Its not worth the chance of tanking a site. Honestly, you have to do a reality check and see how much traffic you are getting from the directory.
The only one we have ever submitted to is DMOZ and yahoo directories. Besides that we stay away from them because you never know what the owner of the domain will do with it.
Not only do you have to worry about the hijack, you also have to worry about the directory owner blackhating or spamming as well which can definately hurt sites that are linked from the directory.
I guess the lesson to be learned is to be very picky about the directories you submit to. Now days the links from directories are not necessary to obtain good serps.
A page on your site providing information about your user-agent is a great idea, too.
For a User-agent, follow the major search engines' approach:
Mozilla/5.0 (compatible; DirectoryLinkChecker/1.0; +http://www.MyDirectory.com/linkchecker.html)
Not only does the OP encourage submitting to directories he encourages reciprocal linking with some of them as long as they still a directory.
The chances that domain/link in question changed hosting company or the hosting company assigned/move domain to a different IP are what?....
I know some hosting companies would BLOCK/DENY PING, but the chances are very slim..
The OP is only talking about reciprocal links.
No, the OP is talking about checking the links SUBMITTED to my directory. If you don't validate the listings in your directory they turn into garbage over time, sometimes in less than 30 days. I'm often surprised that a professional web site, one that someone obviously paid a lot of money to have developed, is lost in short order due to someone missing the domain renewal.
However, you could check reciprocals or any other big link list with the same technology to make sure those links are in good order as well. Links are Links!
Be aware that using tools/library routines like CURL without changing the default User-agent name to something that gives a clue about who is doing the checking may get you banned from many sites.
Jim is correct on this point that I forgot to mention, that the default CURL user agent is blocked in a few places which will result in a false positive as being broken. Not as many sites as you might think blocked CURL as I was only bounced out of a few sites, but I did change the user agent in order to check those links and avoid accidentally marking them as bad.
Remember, I'm not crawling an entire website submitted to my directory, I'm just checking the entry page, usually the index page.
CACHE THE PAGES - If you CACHE all of the pages when you make a full pass validating your directory you can then test various new finger prints rapidly. Using the cached pages on your local hard disk instead of going back and loading each page from each individual web servers can help to rapidly test new code. However, I wouldn't rely on cached pages for more than 24 hours for this purpose.
RUN A SEARCH SITE? What I'm describing isn't just useful for a directory as search engines do similar things to detect broken sites in their index. A conversation I had with an engineer from Live Search confirmed that they too do all the soft 404 checking as well. I was looking for some additional pointers on domain park detection and the guy suddenly got really tight lipped as it appears they see that info as highly confidential, as they don't want the domain parks to figure out how they're being identified.
[edited by: incrediBILL at 4:42 pm (utc) on Feb. 23, 2007]
This is one of the main reasons to properly identify your directory crawler. It is commonly-understood that directories need to re-validate submitted links, and that since there is no actual crawling going on, a request for robots.txt will probably *not* be made. If the targeted site's Webmaster sees a link to a familiar directory in the user-agent string, or can follow the link and be reminded of submitting to that directory, then he/she is less likely to get trigger-happy and ban the user-agent.
Legitimate directory administrators and search engines alike will do well to keep in mind that there is an awful lot of abuse going on, and Web sites are increasingly running with "shields up" to protect themselves from all the scraping and harvesting going on these days. Using a meaningful, informative, and syntactically-correct user-agent string is not only the polite thing to do, it's also a matter of keeping directory/search engine listings comprehensive by making sure the user-agent doesn't get shown to the door when fetching pages.
1. Run Xenu at the beginning of the month
2. Run a Perl script on an outside server at the middle of themonth that downloads the home page of each site. It logs any html refreshes and the page title. Every 6 months I have my father in law look at questionable or blank titles.
3. Every 18 months or so my father in law looks at each site
4. We have a phone call to each company every 24-36 months. Again my father in law. A great way to write him a check he feels good about.
Xenu shows up sites not responding and 301 and 302s.
The Perl script shows up meta-refreshes to new domains and NetSol expired domains and other probable problems indicated by the page title.
What this misses:
2. Companies no longer in existance and the isp or web designer has their website still up.
At the rate my new crawler is going, I should be able to get through all 700,000 URL's within 3 weeks even running in single-threaded mode.
I don't cache the pages, but I do save the host IP, which is a good indicator if a site has gone through a major change. It also makes it easy to detect parked pages since many domain parking services load up their servers with tens of thousands of domains on the same IP.
After my first re-crawl is complete I plan on adding a function which just checks DNS records to see if any of the domains in the listings have expired or have been parked. I think I can check the DNS of all my listings in a few hours.
Thanks for sharing, Bill!