|Monitoring Directory Listings: Are Thumbnail ScreenShots A Panacea for Site Re-Review?|
What are other methods for determining if directory listings are "breaking bad"?
I'm sure many running a directory has some form of link checker, some better than others, and site quarantine procedures for domain redemption periods, assuming you can detect domains that currently aren't active.
However, at some point there is the need for a human to physically look at the web site to determine if it's still active or still being used for what it was when originally submitted to your directory.
When the automated tools all report "200 OK" and none of the analysis tools detect signs that the site is something other than was originally submitted what can you do to speed up reviewing tens of thousands of sites without manually using a browser, a massively painful time consuming situation, to visit each and every site?
Let a screen shot server visit all the sites for you and then you simply review those screen shots.
I have a home brew screen shot generator I've been using for a few years now to generate various screen shot sizes, down to thumbnails used for the site listings, and every now and then I browse all the larger versions of screen shots in groups of 500-1000 per page to see what the sites look like.
At a size of 200x200 you can often read things like "this site has moved" or see an obvious web hosting control panel page, or many other things that have slipped through the cracks of your automated link checking tools.
Sometimes the site was just in transition during the last time you refreshed the screen shot and merely needs a new screen shot, other times, the site is toast and needs to be removed.
Imagine looking at 45 sites per screen on a 24" monitor, or 90 sites with the browser opened across 2 monitors, and you can see where this is going, it's a very speedy optical review method.
Using this method you can easily preview 40K sites before lunch!
If you have the technology, give it a try!
Just be sure your screenshot-er identifies itself properly, or you may be looking at a lot of 403-Access Denied pages... :(
I get no 403s Jim because I'm running screen shots off the grid, it's not in a data center :)
Besides, the tool I use runs MSIE direct and it offers no way to alter the UA or I would've done it already.
[edited by: incrediBILL at 7:48 am (utc) on Mar. 31, 2009]
Ahhh . . . automation and acceleration of tasks. Sounds a great deal like a winner, but only if it's a tool that others can gain access to.
Some of us lack the training to write code and others just don't have the mind for it. For us incompetents is there a Screenshooter Store, some place where we can go to buy screenshooters?
If not then what are the odds that a description of the tool, posted at the various rent-a-coder sites, will draw out a person actually capable of coding this utility? How would you describe the project? What would be a realistic charge for coding such a utility?
|For us incompetents is there a Screenshooter Store, some place where we can go to buy screenshooters? |
Well yes there are such tools available for sale, some free even, as I certainly didn't write the screen shot tool itself.
But if I told you what commercial tool it was that I was using it would get the old TOS edit ;)
The only thing you may need a coder for, depending on your directory software, is to create the export file of URIs that the screen shot generator reads.
Worse case you can browse the resulting screen shots in Windows Explorer as an extra large icon view but you won't see nearly as many images per page as my jam packed custom HTML view. Possibly an off the shelf image viewing package designed for reviewing masses of images would work here but I haven't looked for any for this particular job.
The screen shot tool simply takes a text file list of URIs as input and spits out a bunch of image files in a directory.
FWIW, I run a simple scheduled server task every 5 minutes that downloads the latest additions as a text file, runs the screen shot tool, then uploads those files up to the server;s thumbnail directory all day long and it was simple as heck to setup.
Here's a simplified sample of the .bat file I run on a scheduler to show you how simple it was to create a screen shot server.
|REM download list of URIs |
c:\curl\curl.exe "http://www.example.com/get-url-list.php" > url-list.txt
REM generate screen shots
"C:\Program Files\examplescreenshottool\examplescreenshottool.exe" /in url-list.txt
REM upload to server directory
There are a few support files behind this, like the XML file that defines the size of the screen shots being made and the FTP commands being sent, but it's all simple stuff that any webmaster could do.
The "get-url-list.php" is the only coding possibly required on the server side if you don't already have some exporting functions available.
|it would get the old TOS edit |
:) C'mon. Though I sometimes feel a bit like a bridge troll I've been known to loosen the rules, usually to help craft a thread that is likely to be "useful to many".
I'm game for people discussing particulars of how they handle screenshots, including software in this case.
I'm game for people discussing all methods of handling the issue of monitoring directory style listings to assure their freshness and accuracy. If there are commercial solutions we can address them . . once in a great while. ;)
How do you all expedite the task of keeping out the bad stuff - subsequent transformations of listed websites, websites that have died, sites going from a website to a parked domain page or worse - transformed on expiration to malware download, MFA, pron, etc.?
Manual checking? Argh! Screenshots is a nice trick. Member "reporting" links? Do they work? Outsourcing the grunt work?
[edited by: Webwork at 7:21 pm (utc) on Mar. 31, 2009]
|How do you all expedite the task of keeping out the bad stuff - subsequent transformations of listed websites, websites that have died, sites going from a website to a parked domain page or worse - transformed on expiration to malware download, MFA, pron, etc.? |
My own custom script.
I use the same "curl" program in the script above to download each main web page to a file, with verbose headers and all redirections.
Then I parse it via many hundreds of "fingerprints" I've collected over time to identify sites with a virus, domain park, hosting setup, soft 404s, adult sites, hacked sites and a whole lot more.
Additionally, there are calls made to WHOIS to identify various name parks and to check to see if the domain is still registered if it fails to load.
Even that wasn't terribly complicated but the collection of data to make that process work well can take years.
|Member "reporting" links? Do they work? |
We get lots of member reports, doesn't work as well as it should, but it's better than no member reports.
Now let's say you save the screen shots as a small GIF. Would it not be possible to automate the comparison, detecting if the content of the page has changed significantly since the last crawl? Like a visual checksum. If the page hasn't changed, then there's no reason to review it. Next!
I spent a few years building a system which sounds like it does exactly what you're describing and for the same purpose. It allows a one man operation to manually check tens of thousands of web sites a day. Actually I used to introduce myself to people by saying that I surfed tens of thousands of web sites per day (technically not completely accurate) but it did raise a lot of eyebrows.
My system is set to show 20 thumbnails at a time, along with text and category information (including geo location). If something didn't look correct, I could just check or uncheck the proper box and hit "Next" for the next 20 sites to review. Eventually I added Bayesian filtering to make sure the text on the web page matched the directory category that the owner of the web site put it in. This sped things up even more.
We tried to promise that new directory submissions would be reviewed within 2 hours, which led to hiring some employees to "hit the button" around the globe. We called this "working the hatch" in homage to the TV series Lost when they needed to enter a number into a computer every 104 monutes or the World would come to an end.
I sold this system about a year and a half ago for a nice chunk of change. Sometimes I miss it, most of the time I'm glad it's no longer my responsibility.
There's a better way, it's called Google pagerank.
Run the broken link checker of course but to weed out the bad let Google pagerank do the work.
It's easy, include a feature that checks each sites pagerank and displays it. Next sort the sites by pagerank. The crap sites will invariably end up at the bottom.
If a PR6 site suddenly goes gray bar because the owner starts selling #*$! it will drop to the bottom of your list of sites and you can clear those out easily.
|Now let's say you save the screen shots as a small GIF. Would it not be possible to automate the comparison, detecting if the content of the page has changed significantly since the last crawl? Like a visual checksum. If the page hasn't changed, then there's no reason to review it. Next! |
I cache the text downloaded from the site, for various purposes, but I can compare raw HTML from old to new. However, if your last text download was of a bad page, you won't be alarmed the new one is bad either. As long as your baseline sample is good this method rocks.
However, nothing beats a raw look at what the heck is in your index every now and then.
Sometimes you find you've been gamed when you see the same web page 10x with different domains and different emails trying to cover their tracks.
|There's a better way, it's called Google pagerank. |
Google Pagerank is an interesting idea but it won't work for most in my niche. Many of them are brand new sites trying to get noticed, usually top quality with no PR. Most are the best flash sites you've ever seen which also don't rank well. Unfortunately, often the title will say "Index" or "Home", very sad but an opportunity for me! If they had PR they wouldn't need my help in the first place.
|screen shots as a small GIF. Would it not be possible to automate the comparison, detecting if the content of the page has changed significantly since the last crawl? |
This would work except for rotating or changing content. From a pure "how many pixels have changed" viewpoint, a blog home page changes dramatically when a few new entries have been posted. A large ad block would have completely new pixels on each page load. And, even just the entire page shifting down by one pixel would cause a pixel for pixel comparison to be completely different.
the other thing that I have found is the abandoned sites that don't get taken down. Maybe they owner doesn't notice the auto renewal on his credit card or maybe it is free hosting linked to an ISP contract.
It is a particular issue in the music business, the site exists and checks out technically but the gig list is for 2007.