This builds on another recent thread and I thought I would show how I maintain a directory of 14,000 links in a specific vertical market.
1. All of our links (along with company name, address and phone) are in a Filemaker database. If you do not have a database of the names you could use Xenu to generate the list of valid names or otherwise extract them from your website.
2. At the beginning of each month I run Xenu against a local copy of our website. This will pick up non-responding urls, and urls with 403, 404, 301 and 302 code results. It does not identify html refreshes nor websites that have expired and become a MFA or other junk advertising site.
3. At the middle of the month to find the problem sites, I use a perl script I wrote that runs on our unix server using:
The script returns 2 files which I open in a simple database. The first file contains every url with the html refresh if it exists, where I look for interdomain transfers done by web designers not knowing how to do a proper 301 server redirect when doing domain transfers. The second file contains the page title and url. Pulling this into the simple database, I scan the list looking for duplicate titles, obvious problems – such as expired websites and other typical problem indicators. I tag the record for every blank title, index titles, etc.
4. I export this tagged list of questionable urls to a file. I take a snapshot of the urls using Snapshotter. My thanks to a previous thread where the idea of viewing snap shots of websites to check validity was mentioned. This posting is my payback to that contributor.
I used Tweek UI from Microsoft to change the default size of the thumbnails viewed in the directory to the maximum 256 pixels wide. Using my 1650 pixel wide monitor I can see 5 in each row and 3 high.
This is large enough to identify problem pages. I had saved the snapshot as a 600 pixel wide snapshot although I seldom look at them and delete them once the run is completed.
I scan the thumbnail images of the websites and look for sites that have become MFA (made for adsense) or search engine feeds as well as other problems.
In this month’s pass I identified about 50 problem sites using Xenu (out of 14,000), and another 10 using the script above. This saves a manual look at the sites. My father-in-law had been looking at the 1000 or so questionable sites every quarter and it would take him about 8 hours to do. I did this in about 1 hour of my time.