Playing Around With Bad Bots

Forum Moderators: open

Message Too Old, No Replies

Playing Around With Bad Bots

bakedjake

1:57 pm on Jun 13, 2012 (gmt 0)

I used to give this counter-intelligence presentation at a few of the conferences.

I talked about the idea of giving your competitors false information - for example, when you know your competitors are visiting and/or scraping your site, perhaps you adjust the prices for them.

Sometimes the idea isn't to outright ban bad bots completely.

Often, bad bots are scrapers designed to republish your content.

Many bad bots and republishers aren't all that sophisticated. They will strip links from your content, but they will only do so at a very basic level (say, an exact string match for "<a href=" or so).

A theoretical tactic that you could use to leverage bad bots is to distribute content content that links back to your site. Not the same content that appears on your site.

Mind you, I would never use this tactic for any reason whatsoever, as my expansive moral code prevents me from ever manipulating search engine results for any reason at all.

It's just a theory.

incrediBILL

8:09 pm on Jun 13, 2012 (gmt 0)

Mind you, I would never use this tactic for any reason whatsoever, as my expansive moral code prevents me from ever manipulating search engine results for any reason at all.

I've ROTFLMAO and I can't get up.

But you are correct, the scrapers can be messed with in all sorts of fun ways.

I put data tracking bugs in the content so I can track where the scrapers post their content and connect the dots of scraper and destination. This can be slightly challenging with some that scramble the output or blend it with other content.

Another interesting thing you can do if you want to trash a scraper is just spew out pages of profanity that will stop their output from being displayed to anyone with SafeSearch enabled.

g1smd

8:16 pm on Jun 13, 2012 (gmt 0)

There's loads of things that can be done.

You can return random or doctored content; maybe minor changes that can be tracked, or completely different content that they can republish all they want because it's garbage. You can also play around with the actual status codes that are returned. It's not just 403 that can stop bots.

dstiles

8:41 pm on Jun 13, 2012 (gmt 0)

Do you think an empty page with a 200 would annoy them? :)

Have to say, I like the idea of sending false data. Shame it's one more thing I don't have time to do. :(

bakedjake

9:05 pm on Jun 13, 2012 (gmt 0)

Do you think an empty page with a 200 would annoy them? :)

There's many ways to annoy them. My favorite is tarpitting.

But I'm thinking more about taking advantage of their republishing of content, rather than just annoying them.

lucy24

9:40 pm on Jun 13, 2012 (gmt 0)

They will strip links from your content, but they will only do so at a very basic level (say, an exact string match for "<a href=" or so).

Early this year, I met a robot so gloriously stupid, it tried to follow anything in the form <a \w+ = "{blahblah}". Clearly it never entered its robotic mind that anchors could go "class, name, id, href" in that order. So it spent a lot of time looking for nonexistent files like "/directory/outside" (anchor class) or "/directory/tag_1", "/directory/footnote_2", "/directory/pic_singiqtanga" (anchor names), or... Drat, I've forgotten the third.

:: detour to look ::

Fragments, those are the best of all! While normal humans dutifully go to "#footnote_1", the robot goes in search of "/directory/footnote_1". And with the aid of a simple rewrite, you could really send it there.

Gosh, this page looks just like /directory/outside and /directory/tag_1 and, and, and ... Maybe if I scrape the surface I'll find the difference.

DeeCee

10:40 pm on Jun 13, 2012 (gmt 0)

Very apropos discussion. I just the past days have had a discussion with one owner of a major crawler, and part of one of my several page responses to him last night, on giving scrapers bad content, was:

Lots of possibilities.

Sending them on wild goose chases around the world and making sure that their database/link connections (their data for profit) get polluted and skewed with wild, imaginary �SEO connections� and other junk, that was never as the real internet link-up intended.

The trapped ones also end up in tar-pits that slow them down to pure randomized slow motion so I can watch them in peace doing their dance. Only �good bots� survive visits to those sites. I am just saying�

Some of the bad crawlers can be kept busy �scraping� for several days at a time, with not one stitch of valid information resulting in their databases for all that work..

But bakedjake, that is not skewing search engine results, when the trapper sites are never seen by a real search engine, but merely by bad scrapers ignoring even basic instructions. It is merely a skew of the data for all the other fools that want to make a business out of scraping information and content from other, rather than doing something on their own. :)

blend27

12:59 pm on Jun 15, 2012 (gmt 0)

On smaller sites, Once the get into a trap, I usually have a routine that reads 1 MB image file, and spits out binary data with a few lines replaces by specific words that I could track later to see where it ends up.

Bandwidth is cheap, so for the first several requests, they would get of up to 15 mb of very readable stuff.

Then blanked by 403s.

FUN!

1script

5:30 pm on Jun 15, 2012 (gmt 0)

A theoretical tactic that you could use to leverage bad bots is to distribute content content that links back to your site. Not the same content that appears on your site.

Not in the post-Penguin world, you can easily negative-SEO yourself into a ban (pardon, "manual action") if you start feeding your links to bad sites like that.

wilderness

7:24 pm on Jun 15, 2012 (gmt 0)

Not in the post-Penguin world, you can easily negative-SEO yourself into a ban (pardon, "manual action") if you start feeding your links to bad sites like that.

only if your not aware of the difference between google and joe schmo's bean counter bot.

incrediBILL

8:00 pm on Jun 15, 2012 (gmt 0)

The problem you have screwing around with bad bots is if the bad bots actually publish your foolishness and then the good bots index your foolishness, it can all come back to bite you in the butt.

A quick for instance was a bright idea I had of tagging all links in a page with a code that ID's the source of the original page request. Fun for tracking humans using a TOR proxy, or rotating IP pools like AOL, or bots that crawl from multiple IPs, or simply tracking where the data lands. Unfortunately if you put it in the path you need to block all those paths in robots.txt, which Google WMT's will bitch about being kept away from, and if you use a parameter instead then the SE's think it's a new page.

Just plan carefully is all I'm saying or your little bit of fun could result in a trip to the burn unit.