homepage Welcome to WebmasterWorld Guest from 54.242.231.109
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Directories
Forum Library, Charter, Moderators: Webwork & skibum

Directories Forum

    
Directory Listing Hijack Detection
Are your directories full of domain parks?
incrediBILL




msg:3259593
 12:30 am on Feb 22, 2007 (gmt 0)

The problem I've been running into for quite some time now is that thousands of sites, both old and new, succumb to the domain park crowd and worse. One day you have a link to "XYZ Plumbing" and the next day it's either a domain park site, porn redirect, or <shudders> a ringtone site.

I know this isn't just a problem I was having as I checked a few of my competitors and their sites were (and still are) filthy with bad links.

With tens of thousands of links in a directory, manually re-checking each link one at a time is out of the question.

The solution?

Write your own custom link checker on steroids.

We're talking about getting into the guts of the HTTP protocol here, looking at headers, as you need to be in the loop on every little detail, including intermediate redirect locations, and much more in order to effectively identify bad listings and truly clean up your directory.

Don't panic as this isn't as complicated as it sounds and you Linux people can easily use CURL to get a page complete with headers to examine. The biggest challenge is processing redirects one at a time opposed to letting CURL just follow the redirects itself, which is an option. Sometimes you get text back with the redirect which can be useful, opposed to letting CURL just follow the redirects which returns all the intermediate headers but the final page content only so you miss intermediate messages. However, when you run into ASPSESSION cookies letting CURL follow redirects might save you a few hours of frustration and pulled out hair.

Some things to check for:

1. Soft 404s or redirected 404's that ultimately return a "200 OK". Some servers still return the good old hard 404 error but more often than not, especially with free hosting or some servers managed with control panels, you don't get hard 404's. This means you have to build up a large list of finger prints from what is returned in the page content, titles and HTTP headers (in redirects) to indentify these soft 404's.

2. Redirects to landing pages, a key to identifying many domain parks. Domain parks tend to have a centralized processing of all the domains they control and will ultimately redirect to the final destination after passing thru some easily identifiable locations that are unique to those types of operations. Keep in mind that not all landing pages are permanent (or bad) as domain registrars will temporarily park sites being moved, so those should be put on hold.

3. Check the resulting text page for finger prints that identify domain parks, scrapers, MFA pages, sites that are INFECTED with malware and more. This is the complicated part as some things that initially look like obvious finger prints turn out to generate false positives and flag good sites. Needless to say you have to use some caution building this list and spot check sites that return a positive for each finger print before relying on it blindly.

4. Endless redirects, I stop at 10. Some sites get temporarily broken by people monkeying with their .htaccess files that loop out of control. Maybe they will eventually fix the problem, but until they do, put it on hold. As a fail safe, just to make sure my code isn't flawed, I pass sites that loop to CURL with the option to follow the redirects and see if CURL can actually get to a page. If CURL fails as well it's put on hold.

5. Timeouts, sites that flat don't respond but are still registered domains located on an actual server. Don't care why it isn't responding, but it's put on hold until they get it together and start serving pages.

6. Unregistered sites. If you get an error check the WHOIS and see if it's no longer registered which is a good clue you can dump it as it's not even in renewal, it's just gone.

FWIW, if you have a fairly large directory you will build hundreds of finger prints to flag bad sites in short order.

Now that you have a super duper link checker, what do you do with the sites you find that are broken?

Here's what I do with the sites that fail the link checker and are put on hold:

A. Sites that fail the link checker are grouped by type of failure and quarantined for later review. These sites are spot checked to make sure the link checker isn't getting false positives.

B. Periodically re-scan all quarantined sites in a couple of weeks or monthly and put sites that respond as active into a manual review queue.

C. Review sites flagged for re-inclusion manually before releasing to make sure that the domain park or whatever didn't just change their finger print as this happens all the time, and to make sure sites flagged with malware have been truly cleaned up.

D. 90 days cut-off. Anything held in quarantine for more than 90 days that hasn't become active again, or the malware isn't removed, is unceremoniously dumped.

That's the basic idea of what I'm doing, mostly automated with some human review.

Building a custom link checker and finger printing pages may sound like a lot of work but in the end if pays off massively. You can clean your directory automatically and effortlessly, over and over again, and in the process make sure your visitors have an excellent experience on your site.

The best part is your competitors will still serve up bad pages and as visitor frustration mounts they will abandon those sites and your traffic will increase.

 

elguiri




msg:3261361
 1:01 pm on Feb 23, 2007 (gmt 0)

I'm mailing this to my programmer.

Birdman




msg:3261371
 1:14 pm on Feb 23, 2007 (gmt 0)

Good post incrediBill! Just yesterday I manually checked all the recip links on a site and I found about 10% were parked! Luckily, it wasn't a large collection of links.

I wonder how all those bad links impact your SE ranks?

trinorthlighting




msg:3261406
 1:43 pm on Feb 23, 2007 (gmt 0)

Bill,

That right there is the big reason why not to submit to thousands of directories. Its not worth the chance of tanking a site. Honestly, you have to do a reality check and see how much traffic you are getting from the directory.

The only one we have ever submitted to is DMOZ and yahoo directories. Besides that we stay away from them because you never know what the owner of the domain will do with it.

Not only do you have to worry about the hijack, you also have to worry about the directory owner blackhating or spamming as well which can definately hurt sites that are linked from the directory.

I guess the lesson to be learned is to be very picky about the directories you submit to. Now days the links from directories are not necessary to obtain good serps.

jdMorgan




msg:3261432
 2:12 pm on Feb 23, 2007 (gmt 0)

Be aware that using tools/library routines like CURL without changing the default User-agent name to something that gives a clue about who is doing the checking may get you banned from many sites. It's also a good idea to limit the fetch rate from any one given site for the same reason.

A page on your site providing information about your user-agent is a great idea, too.

For a User-agent, follow the major search engines' approach:

Mozilla/5.0 (compatible; DirectoryLinkChecker/1.0; +http://www.MyDirectory.com/linkchecker.html)

Jim

carguy84




msg:3261530
 3:25 pm on Feb 23, 2007 (gmt 0)

So wait, people who land on a directory site click links other than the back button?

odd.

ogletree




msg:3261531
 3:26 pm on Feb 23, 2007 (gmt 0)

The OP is only talking about reciprocal links. This has nothing to do with sites that link to you. Just submitting to 2 directories is irresponsible. Directory submissions are a part of any link strategy. Google even encourages submitting to directories. I have taken a site that has never had anything done to it and submitted to 500 directories in 5 days and it ranks #3 for a good local term. You people need to learn what is not black hat.

Not only does the OP encourage submitting to directories he encourages reciprocal linking with some of them as long as they still a directory.

blend27




msg:3261542
 3:41 pm on Feb 23, 2007 (gmt 0)

incrediBILL, wouldn’t the IP Address of a link in question change if domain becomes "parked"? just a thought....

The chances that domain/link in question changed hosting company or the hosting company assigned/move domain to a different IP are what?....
I know some hosting companies would BLOCK/DENY PING, but the chances are very slim..

Hollywood




msg:3261578
 3:59 pm on Feb 23, 2007 (gmt 0)

If someone has a solution to a good link checker as mentioned here that is functioning now and has features as good as this would you please sticky me, it is annoying trying to find stuff that works.

This sounds really good!

incrediBILL




msg:3261604
 4:23 pm on Feb 23, 2007 (gmt 0)

The OP is only talking about reciprocal links.

No, the OP is talking about checking the links SUBMITTED to my directory. If you don't validate the listings in your directory they turn into garbage over time, sometimes in less than 30 days. I'm often surprised that a professional web site, one that someone obviously paid a lot of money to have developed, is lost in short order due to someone missing the domain renewal.

However, you could check reciprocals or any other big link list with the same technology to make sure those links are in good order as well. Links are Links!

Be aware that using tools/library routines like CURL without changing the default User-agent name to something that gives a clue about who is doing the checking may get you banned from many sites.

Jim is correct on this point that I forgot to mention, that the default CURL user agent is blocked in a few places which will result in a false positive as being broken. Not as many sites as you might think blocked CURL as I was only bounced out of a few sites, but I did change the user agent in order to check those links and avoid accidentally marking them as bad.

Remember, I'm not crawling an entire website submitted to my directory, I'm just checking the entry page, usually the index page.

incrediBILL




msg:3261644
 4:40 pm on Feb 23, 2007 (gmt 0)

Ah yes, a couple of more tips...

CACHE THE PAGES - If you CACHE all of the pages when you make a full pass validating your directory you can then test various new finger prints rapidly. Using the cached pages on your local hard disk instead of going back and loading each page from each individual web servers can help to rapidly test new code. However, I wouldn't rely on cached pages for more than 24 hours for this purpose.

RUN A SEARCH SITE? What I'm describing isn't just useful for a directory as search engines do similar things to detect broken sites in their index. A conversation I had with an engineer from Live Search confirmed that they too do all the soft 404 checking as well. I was looking for some additional pointers on domain park detection and the guy suddenly got really tight lipped as it appears they see that info as highly confidential, as they don't want the domain parks to figure out how they're being identified.

[edited by: incrediBILL at 4:42 pm (utc) on Feb. 23, 2007]

jdMorgan




msg:3261662
 4:59 pm on Feb 23, 2007 (gmt 0)

> Remember, I'm not crawling an entire website submitted to my directory, I'm just checking the entry page, usually the index page.

This is one of the main reasons to properly identify your directory crawler. It is commonly-understood that directories need to re-validate submitted links, and that since there is no actual crawling going on, a request for robots.txt will probably *not* be made. If the targeted site's Webmaster sees a link to a familiar directory in the user-agent string, or can follow the link and be reminded of submitting to that directory, then he/she is less likely to get trigger-happy and ban the user-agent.

Legitimate directory administrators and search engines alike will do well to keep in mind that there is an awful lot of abuse going on, and Web sites are increasingly running with "shields up" to protect themselves from all the scraping and harvesting going on these days. Using a meaningful, informative, and syntactically-correct user-agent string is not only the polite thing to do, it's also a matter of keeping directory/search engine listings comprehensive by making sure the user-agent doesn't get shown to the door when fetching pages.

Jim

4specs




msg:3261742
 5:39 pm on Feb 23, 2007 (gmt 0)

I maintain a library service directory with over 13,000 links. My goal is under 1% problem links. I do it this way:

1. Run Xenu at the beginning of the month
2. Run a Perl script on an outside server at the middle of themonth that downloads the home page of each site. It logs any html refreshes and the page title. Every 6 months I have my father in law look at questionable or blank titles.
3. Every 18 months or so my father in law looks at each site
4. We have a phone call to each company every 24-36 months. Again my father in law. A great way to write him a check he feels good about.

Xenu shows up sites not responding and 301 and 302s.

The Perl script shows up meta-refreshes to new domains and NetSol expired domains and other probable problems indicated by the page title.

What this misses:

1. Dumb web designers doing a javascript refresh to a new domain or just a link on the page - we find these with a manual check.
2. Companies no longer in existance and the isp or web designer has their website still up.

incrediBILL




msg:3261755
 5:45 pm on Feb 23, 2007 (gmt 0)

Can we borrow your father-in-law?

CainIV




msg:3261881
 7:13 pm on Feb 23, 2007 (gmt 0)

Sounds great Bill, when you are marketing it, and what is the list price? :P

ogletree




msg:3262074
 9:39 pm on Feb 23, 2007 (gmt 0)

Sorry about that Bill. The first post does not seem clear that you are talking about being a directory owner. Specially if you don't read the sub title. I was correct in that you were talking about outgoing links and not incoming. I was just confused about the fact that you have a directory.

carguy84




msg:3262623
 1:41 pm on Feb 24, 2007 (gmt 0)

If it were me, I'd be keeping track of domain names, IP Addresses and date last visited. Any domain based change is going to show up as an IP Address change, so that eliminates any domains which suddenly show parked pages, change of ownership, expirations... Then I'd programmatically compare the homepage of the cached version I saved to the one I just fetched and if the differences reached a certain threshold, I'd kick off an alert email to myself. And if I was bored enough, I'd keep a running save of cached pages to do even better comparison analysis on.

Chip-

dataguy




msg:3266101
 1:26 am on Feb 28, 2007 (gmt 0)

Cool that you brought this up now. My off-the-shelf crawler wasn't cutting it for checking listings in my directory, so a few weeks ago I created my own "fresh-bot" to try to keep my listings clean. It was about 4 months since I have been able to do a complete crawl of the listings in my directory, and I'm finding that about 18% have gone bad during that time. I have nearly 700,000 listings so 18% is substantial.

At the rate my new crawler is going, I should be able to get through all 700,000 URL's within 3 weeks even running in single-threaded mode.

I don't cache the pages, but I do save the host IP, which is a good indicator if a site has gone through a major change. It also makes it easy to detect parked pages since many domain parking services load up their servers with tens of thousands of domains on the same IP.

After my first re-crawl is complete I plan on adding a function which just checks DNS records to see if any of the domains in the listings have expired or have been parked. I think I can check the DNS of all my listings in a few hours.

Thanks for sharing, Bill!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Directories
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved