homepage Welcome to WebmasterWorld Guest from 54.196.194.204
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
What's Crawling My Site?
incrediBILL




msg:4681275
 7:15 pm on Jun 19, 2014 (gmt 0)

Recently someone posted a simple question that we haven't addressed lately:

WHAT's ALL THIS STUFF CRAWLING MY SITE?

Here's a list I came up with a while back that probably needs updating.
  • Intelligence gathering Spybots
    1. Copyright Compliance
    2. Branding Compliance
    3. Corporate Security Monitoring
    4. Media Monitoring (mp3, mpeg, etc.)
    5. Myriad of Safe-Site Monitoring solutions
    6. Government monitoring solutions
  • Content Scrapers (pure theft)
  • Data Aggregators
  • Link Checkers
  • Privacy Checkers
  • Web Copiers/Downloaders
  • Offline Web Browsers
  • Many open-source crawlers ie. Nutch and Heritrix

Before getting all paranoid, the government monitoring I mentioned is 3rd parties that mine data and sell their reports to various agencies which is completely legal without a warrant because the reports already exist. <wink> <wink> <nod> <nod>

Anyone got anything else to add to the list that I'm overlooking?

To highlight the problem, I have a domain that is basically a honeypot for bots just tot see what would hit and log everything I could detect wasn't human, which is pretty easy as no humans visit that site.

Here's the latest update:
[incredibill.net...]

I have to link to the report as it's just too big to import into WebmasterWorld.

Note all the highlighted entries that look like browsers that would slip past most user agent black listing which is why data center blocking is the only way to stop that nonsense.

Hope this answers the question of what's crawling on your site and why.

 

aristotle




msg:4682832
 7:16 pm on Jun 25, 2014 (gmt 0)

Thanks for this information. I didn't know that there are so many different types of bots, and purposes for using them.

However, although I know that others here may not agree, in my opinion it's best to ignore most of these bots, and only take action when you notice one of them becoming obnoxious. Bandwitdth is cheap, and trying to block everything that comes around consumes a lot of time. Unfortunately for the past few months I've been having to deal with some kind of botnet, and I appreciate all of the help everyone here has given me. But I've really gotten tired of it, and would prefer to be spending my time on other things.

keyplyr




msg:4682859
 9:00 pm on Jun 25, 2014 (gmt 0)



Anyone got anything else to add to the list that I'm overlooking?


Search Engines
Ad Marketing

aristotle




msg:4682863
 9:11 pm on Jun 25, 2014 (gmt 0)

What about email address harvesters? Are they already included in one of the mentioned types? One of my mistakes when I designed my first website was to include an email address on the pages. After that, I started putting email addresses in an image.

aristotle




msg:4682865
 9:17 pm on Jun 25, 2014 (gmt 0)

Also, what about the bots that hackers use to look for vulnerabilities?

dstiles




msg:4683421
 7:24 pm on Jun 27, 2014 (gmt 0)

aristotle:
> Bandwitdth is cheap, and trying to block everything ...

A way to guarantee your web site will be copied (if it's worth any traffic) and be placed higher in search engines than your own site.

I've seen complaints about "my site is higher for scrapers" in the google forum so often it's ludicrous. No one seems to take any notice when I say "prevent scrapers" so I gave up posting.

It has nothing at all to do with bandwidth and everything to do with security, from fending off scrapers to rejecting virus-injectors. And any of these are as likely to come from compromised broadband botnets as from servers, compromised or not.

aristotle




msg:4683426
 7:39 pm on Jun 27, 2014 (gmt 0)

dstiles-
Thanks for the warning, but it's way too late in my case. Most of my articles have been scraped numerous times starting years ago. Maybe I'm lucky, but so far Google and Bing appear to know that my sites were the original source, and I've never seen any of my pages outranked by a scraper.

It's true that a lot of my images have been been displaced by stolen copies in image searches, but for the types of sites that I have, image traffic doesn't matter to me.

Anyway, if you know of a totally full-proof way to prevent content from being scraped, let me know, and maybe I'll try to use it if I ever launch a new site.

lucy24




msg:4683462
 9:01 pm on Jun 27, 2014 (gmt 0)

if you know of a totally f[oo]l-proof way to prevent content from being scraped

"It is impossible to make anything foolproof, because fools are so damned ingenious."

The only 100% reliable way to keep your content from being scraped is not to put it online. Although having content nobody else is interested in comes a pretty close second.

incrediBILL




msg:4683479
 9:40 pm on Jun 27, 2014 (gmt 0)

No one seems to take any notice when I say "prevent scrapers" so I gave up posting.


People using the appliance blog software aren't technical The problem is there is a major technical gap from running a blog to doing server side maintenance and installations.

The only way that'll change is if someone names a bot blocker product something like Magnum Condom for WordPress, keep your site from getting major league screwed.

It has to be free, or cheap as hell, and install as easy as any other plug-in.

If you can accomplish that, you MIGHT stop the whiners from complaining about scrapers out ranking them.

Anything else is obviously too technical for them to deal with.

aristotle




msg:4683490
 10:46 pm on Jun 27, 2014 (gmt 0)

if you know of a totally f[oo]l-proof way to prevent content from being scraped

Thanks for correcting my wrong choice of word, Lucy. I was thinking "fool" in my mind, but somehow typed "full". Odd things like that happen sometimes.

keyplyr




msg:4683496
 12:59 am on Jun 28, 2014 (gmt 0)

I like... if you know of a totally foo-proof way to prevent content from being scraped :)

lucy24




msg:4683498
 1:14 am on Jun 28, 2014 (gmt 0)

Odd things like that happen sometimes.

Several times a day, in my case, which may be why I notice it so readily when someone else does it :) ("When my fingers typed A, my brain of course meant B.")

But no, I don't think a fully foolproof solution exists...

incrediBILL




msg:4683532
 5:51 am on Jun 28, 2014 (gmt 0)

But no, I don't think a fully foolproof solution exists...


Not based on blacklisting, no ;)

But totally foolproof is impossible.

It's a war of escalation, the better we block, the better they evade blockage, so on and so forth.

aristotle




msg:4683586
 12:45 pm on Jun 28, 2014 (gmt 0)

You can block all the bots you want, but that won't stop some human who's searching around for content to add to their site or blog.

tangor




msg:4683619
 4:48 pm on Jun 28, 2014 (gmt 0)

Until "cut and paste" and "browser cache" can be defeated/controlled it's not just bots we have to worry about for scraping content. Most bot traffic can be controlled via whitelisting to some extent, but even then that is insufficient. As noted in the film "War Games" the only winning move is not to play, ie. don't put up a website.

As for iBill's list of what's crawling, does it include the "preview" and "snapshot" bots? What about FB's intrusive (I suppose that is among the link checkers category)?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved