Welcome to WebmasterWorld Guest from 54.162.12.134

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

What's Crawling My Site?

     
7:15 pm on Jun 19, 2014 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Recently someone posted a simple question that we haven't addressed lately:

WHAT's ALL THIS STUFF CRAWLING MY SITE?

Here's a list I came up with a while back that probably needs updating.
  • Intelligence gathering Spybots
    1. Copyright Compliance
    2. Branding Compliance
    3. Corporate Security Monitoring
    4. Media Monitoring (mp3, mpeg, etc.)
    5. Myriad of Safe-Site Monitoring solutions
    6. Government monitoring solutions
  • Content Scrapers (pure theft)
  • Data Aggregators
  • Link Checkers
  • Privacy Checkers
  • Web Copiers/Downloaders
  • Offline Web Browsers
  • Many open-source crawlers ie. Nutch and Heritrix

Before getting all paranoid, the government monitoring I mentioned is 3rd parties that mine data and sell their reports to various agencies which is completely legal without a warrant because the reports already exist. <wink> <wink> <nod> <nod>

Anyone got anything else to add to the list that I'm overlooking?

To highlight the problem, I have a domain that is basically a honeypot for bots just tot see what would hit and log everything I could detect wasn't human, which is pretty easy as no humans visit that site.

Here's the latest update:
[incredibill.net...]

I have to link to the report as it's just too big to import into WebmasterWorld.

Note all the highlighted entries that look like browsers that would slip past most user agent black listing which is why data center blocking is the only way to stop that nonsense.

Hope this answers the question of what's crawling on your site and why.
7:16 pm on Jun 25, 2014 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Thanks for this information. I didn't know that there are so many different types of bots, and purposes for using them.

However, although I know that others here may not agree, in my opinion it's best to ignore most of these bots, and only take action when you notice one of them becoming obnoxious. Bandwitdth is cheap, and trying to block everything that comes around consumes a lot of time. Unfortunately for the past few months I've been having to deal with some kind of botnet, and I appreciate all of the help everyone here has given me. But I've really gotten tired of it, and would prefer to be spending my time on other things.
9:00 pm on Jun 25, 2014 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month





Anyone got anything else to add to the list that I'm overlooking?


Search Engines
Ad Marketing
9:11 pm on Jun 25, 2014 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



What about email address harvesters? Are they already included in one of the mentioned types? One of my mistakes when I designed my first website was to include an email address on the pages. After that, I started putting email addresses in an image.
9:17 pm on Jun 25, 2014 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Also, what about the bots that hackers use to look for vulnerabilities?
7:24 pm on Jun 27, 2014 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



aristotle:
> Bandwitdth is cheap, and trying to block everything ...

A way to guarantee your web site will be copied (if it's worth any traffic) and be placed higher in search engines than your own site.

I've seen complaints about "my site is higher for scrapers" in the google forum so often it's ludicrous. No one seems to take any notice when I say "prevent scrapers" so I gave up posting.

It has nothing at all to do with bandwidth and everything to do with security, from fending off scrapers to rejecting virus-injectors. And any of these are as likely to come from compromised broadband botnets as from servers, compromised or not.
7:39 pm on Jun 27, 2014 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



dstiles-
Thanks for the warning, but it's way too late in my case. Most of my articles have been scraped numerous times starting years ago. Maybe I'm lucky, but so far Google and Bing appear to know that my sites were the original source, and I've never seen any of my pages outranked by a scraper.

It's true that a lot of my images have been been displaced by stolen copies in image searches, but for the types of sites that I have, image traffic doesn't matter to me.

Anyway, if you know of a totally full-proof way to prevent content from being scraped, let me know, and maybe I'll try to use it if I ever launch a new site.
9:01 pm on Jun 27, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



if you know of a totally f[oo]l-proof way to prevent content from being scraped

"It is impossible to make anything foolproof, because fools are so damned ingenious."

The only 100% reliable way to keep your content from being scraped is not to put it online. Although having content nobody else is interested in comes a pretty close second.
9:40 pm on Jun 27, 2014 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



No one seems to take any notice when I say "prevent scrapers" so I gave up posting.


People using the appliance blog software aren't technical The problem is there is a major technical gap from running a blog to doing server side maintenance and installations.

The only way that'll change is if someone names a bot blocker product something like Magnum Condom for WordPress, keep your site from getting major league screwed.

It has to be free, or cheap as hell, and install as easy as any other plug-in.

If you can accomplish that, you MIGHT stop the whiners from complaining about scrapers out ranking them.

Anything else is obviously too technical for them to deal with.
10:46 pm on Jun 27, 2014 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



if you know of a totally f[oo]l-proof way to prevent content from being scraped

Thanks for correcting my wrong choice of word, Lucy. I was thinking "fool" in my mind, but somehow typed "full". Odd things like that happen sometimes.
12:59 am on Jun 28, 2014 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I like... if you know of a totally foo-proof way to prevent content from being scraped :)
1:14 am on Jun 28, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Odd things like that happen sometimes.

Several times a day, in my case, which may be why I notice it so readily when someone else does it :) ("When my fingers typed A, my brain of course meant B.")

But no, I don't think a fully foolproof solution exists...
5:51 am on Jun 28, 2014 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



But no, I don't think a fully foolproof solution exists...


Not based on blacklisting, no ;)

But totally foolproof is impossible.

It's a war of escalation, the better we block, the better they evade blockage, so on and so forth.
12:45 pm on Jun 28, 2014 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



You can block all the bots you want, but that won't stop some human who's searching around for content to add to their site or blog.
4:48 pm on Jun 28, 2014 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Until "cut and paste" and "browser cache" can be defeated/controlled it's not just bots we have to worry about for scraping content. Most bot traffic can be controlled via whitelisting to some extent, but even then that is insufficient. As noted in the film "War Games" the only winning move is not to play, ie. don't put up a website.

As for iBill's list of what's crawling, does it include the "preview" and "snapshot" bots? What about FB's intrusive (I suppose that is among the link checkers category)?
 

Featured Threads

Hot Threads This Week

Hot Threads This Month