Welcome to WebmasterWorld Guest from 54.225.22.139

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

Playing Around With Bad Bots

     
1:57 pm on Jun 13, 2012 (gmt 0)

Administrator from CA 

WebmasterWorld Administrator bakedjake is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 8, 2003
posts:3873
votes: 46


I used to give this counter-intelligence presentation at a few of the conferences.

I talked about the idea of giving your competitors false information - for example, when you know your competitors are visiting and/or scraping your site, perhaps you adjust the prices for them.

Sometimes the idea isn't to outright ban bad bots completely.

Often, bad bots are scrapers designed to republish your content.

Many bad bots and republishers aren't all that sophisticated. They will strip links from your content, but they will only do so at a very basic level (say, an exact string match for "<a href=" or so).

A theoretical tactic that you could use to leverage bad bots is to distribute content content that links back to your site. Not the same content that appears on your site.

Mind you, I would never use this tactic for any reason whatsoever, as my expansive moral code prevents me from ever manipulating search engine results for any reason at all.

It's just a theory.
8:09 pm on June 13, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14650
votes: 94


Mind you, I would never use this tactic for any reason whatsoever, as my expansive moral code prevents me from ever manipulating search engine results for any reason at all.


I've ROTFLMAO and I can't get up.

But you are correct, the scrapers can be messed with in all sorts of fun ways.

I put data tracking bugs in the content so I can track where the scrapers post their content and connect the dots of scraper and destination. This can be slightly challenging with some that scramble the output or blend it with other content.

Another interesting thing you can do if you want to trash a scraper is just spew out pages of profanity that will stop their output from being displayed to anyone with SafeSearch enabled.
8:16 pm on June 13, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


There's loads of things that can be done.

You can return random or doctored content; maybe minor changes that can be tracked, or completely different content that they can republish all they want because it's garbage. You can also play around with the actual status codes that are returned. It's not just 403 that can stop bots.
8:41 pm on June 13, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3121
votes: 3


Do you think an empty page with a 200 would annoy them? :)

Have to say, I like the idea of sending false data. Shame it's one more thing I don't have time to do. :(
9:05 pm on June 13, 2012 (gmt 0)

Administrator from CA 

WebmasterWorld Administrator bakedjake is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 8, 2003
posts:3873
votes: 46


Do you think an empty page with a 200 would annoy them? :)


There's many ways to annoy them. My favorite is tarpitting.

But I'm thinking more about taking advantage of their republishing of content, rather than just annoying them.
9:40 pm on June 13, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


They will strip links from your content, but they will only do so at a very basic level (say, an exact string match for "<a href=" or so).

Early this year, I met a robot so gloriously stupid, it tried to follow anything in the form <a \w+ = "{blahblah}". Clearly it never entered its robotic mind that anchors could go "class, name, id, href" in that order. So it spent a lot of time looking for nonexistent files like "/directory/outside" (anchor class) or "/directory/tag_1", "/directory/footnote_2", "/directory/pic_singiqtanga" (anchor names), or... Drat, I've forgotten the third.

:: detour to look ::

Fragments, those are the best of all! While normal humans dutifully go to "#footnote_1", the robot goes in search of "/directory/footnote_1". And with the aid of a simple rewrite, you could really send it there.

Gosh, this page looks just like /directory/outside and /directory/tag_1 and, and, and ... Maybe if I scrape the surface I'll find the difference.
10:40 pm on June 13, 2012 (gmt 0)

Junior Member

joined:Dec 1, 2011
posts: 192
votes: 0


Very apropos discussion. I just the past days have had a discussion with one owner of a major crawler, and part of one of my several page responses to him last night, on giving scrapers bad content, was:


Lots of possibilities.

Sending them on wild goose chases around the world and making sure that their database/link connections (their data for profit) get polluted and skewed with wild, imaginary “SEO connections” and other junk, that was never as the real internet link-up intended.

The trapped ones also end up in tar-pits that slow them down to pure randomized slow motion so I can watch them in peace doing their dance. Only “good bots” survive visits to those sites. I am just saying…

Some of the bad crawlers can be kept busy “scraping” for several days at a time, with not one stitch of valid information resulting in their databases for all that work..


But bakedjake, that is not skewing search engine results, when the trapper sites are never seen by a real search engine, but merely by bad scrapers ignoring even basic instructions. It is merely a skew of the data for all the other fools that want to make a business out of scraping information and content from other, rather than doing something on their own. :)
12:59 pm on June 15, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1822
votes: 47


On smaller sites, Once the get into a trap, I usually have a routine that reads 1 MB image file, and spits out binary data with a few lines replaces by specific words that I could track later to see where it ends up.

Bandwidth is cheap, so for the first several requests, they would get of up to 15 mb of very readable stuff.

Then blanked by 403s.

FUN!
5:30 pm on June 15, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 17, 2006
posts:838
votes: 0


A theoretical tactic that you could use to leverage bad bots is to distribute content content that links back to your site. Not the same content that appears on your site.

Not in the post-Penguin world, you can easily negative-SEO yourself into a ban (pardon, "manual action") if you start feeding your links to bad sites like that.
7:24 pm on June 15, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


Not in the post-Penguin world, you can easily negative-SEO yourself into a ban (pardon, "manual action") if you start feeding your links to bad sites like that.


only if your not aware of the difference between google and joe schmo's bean counter bot.
8:00 pm on June 15, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14650
votes: 94


The problem you have screwing around with bad bots is if the bad bots actually publish your foolishness and then the good bots index your foolishness, it can all come back to bite you in the butt.

A quick for instance was a bright idea I had of tagging all links in a page with a code that ID's the source of the original page request. Fun for tracking humans using a TOR proxy, or rotating IP pools like AOL, or bots that crawl from multiple IPs, or simply tracking where the data lands. Unfortunately if you put it in the path you need to block all those paths in robots.txt, which Google WMT's will bitch about being kept away from, and if you use a parameter instead then the SE's think it's a new page.

Just plan carefully is all I'm saying or your little bit of fun could result in a trip to the burn unit.