homepage Welcome to WebmasterWorld Guest from 54.211.73.232
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Playing Around With Bad Bots
bakedjake

WebmasterWorld Administrator bakedjake us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4464941 posted 1:57 pm on Jun 13, 2012 (gmt 0)

I used to give this counter-intelligence presentation at a few of the conferences.

I talked about the idea of giving your competitors false information - for example, when you know your competitors are visiting and/or scraping your site, perhaps you adjust the prices for them.

Sometimes the idea isn't to outright ban bad bots completely.

Often, bad bots are scrapers designed to republish your content.

Many bad bots and republishers aren't all that sophisticated. They will strip links from your content, but they will only do so at a very basic level (say, an exact string match for "<a href=" or so).

A theoretical tactic that you could use to leverage bad bots is to distribute content content that links back to your site. Not the same content that appears on your site.

Mind you, I would never use this tactic for any reason whatsoever, as my expansive moral code prevents me from ever manipulating search engine results for any reason at all.

It's just a theory.

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4464941 posted 8:09 pm on Jun 13, 2012 (gmt 0)

Mind you, I would never use this tactic for any reason whatsoever, as my expansive moral code prevents me from ever manipulating search engine results for any reason at all.


I've ROTFLMAO and I can't get up.

But you are correct, the scrapers can be messed with in all sorts of fun ways.

I put data tracking bugs in the content so I can track where the scrapers post their content and connect the dots of scraper and destination. This can be slightly challenging with some that scramble the output or blend it with other content.

Another interesting thing you can do if you want to trash a scraper is just spew out pages of profanity that will stop their output from being displayed to anyone with SafeSearch enabled.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4464941 posted 8:16 pm on Jun 13, 2012 (gmt 0)

There's loads of things that can be done.

You can return random or doctored content; maybe minor changes that can be tracked, or completely different content that they can republish all they want because it's garbage. You can also play around with the actual status codes that are returned. It's not just 403 that can stop bots.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4464941 posted 8:41 pm on Jun 13, 2012 (gmt 0)

Do you think an empty page with a 200 would annoy them? :)

Have to say, I like the idea of sending false data. Shame it's one more thing I don't have time to do. :(

bakedjake

WebmasterWorld Administrator bakedjake us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4464941 posted 9:05 pm on Jun 13, 2012 (gmt 0)

Do you think an empty page with a 200 would annoy them? :)


There's many ways to annoy them. My favorite is tarpitting.

But I'm thinking more about taking advantage of their republishing of content, rather than just annoying them.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4464941 posted 9:40 pm on Jun 13, 2012 (gmt 0)

They will strip links from your content, but they will only do so at a very basic level (say, an exact string match for "<a href=" or so).

Early this year, I met a robot so gloriously stupid, it tried to follow anything in the form <a \w+ = "{blahblah}". Clearly it never entered its robotic mind that anchors could go "class, name, id, href" in that order. So it spent a lot of time looking for nonexistent files like "/directory/outside" (anchor class) or "/directory/tag_1", "/directory/footnote_2", "/directory/pic_singiqtanga" (anchor names), or... Drat, I've forgotten the third.

:: detour to look ::

Fragments, those are the best of all! While normal humans dutifully go to "#footnote_1", the robot goes in search of "/directory/footnote_1". And with the aid of a simple rewrite, you could really send it there.

Gosh, this page looks just like /directory/outside and /directory/tag_1 and, and, and ... Maybe if I scrape the surface I'll find the difference.

DeeCee



 
Msg#: 4464941 posted 10:40 pm on Jun 13, 2012 (gmt 0)

Very apropos discussion. I just the past days have had a discussion with one owner of a major crawler, and part of one of my several page responses to him last night, on giving scrapers bad content, was:


Lots of possibilities.

Sending them on wild goose chases around the world and making sure that their database/link connections (their data for profit) get polluted and skewed with wild, imaginary “SEO connections” and other junk, that was never as the real internet link-up intended.

The trapped ones also end up in tar-pits that slow them down to pure randomized slow motion so I can watch them in peace doing their dance. Only “good bots” survive visits to those sites. I am just saying…

Some of the bad crawlers can be kept busy “scraping” for several days at a time, with not one stitch of valid information resulting in their databases for all that work..


But bakedjake, that is not skewing search engine results, when the trapper sites are never seen by a real search engine, but merely by bad scrapers ignoring even basic instructions. It is merely a skew of the data for all the other fools that want to make a business out of scraping information and content from other, rather than doing something on their own. :)

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4464941 posted 12:59 pm on Jun 15, 2012 (gmt 0)

On smaller sites, Once the get into a trap, I usually have a routine that reads 1 MB image file, and spits out binary data with a few lines replaces by specific words that I could track later to see where it ends up.

Bandwidth is cheap, so for the first several requests, they would get of up to 15 mb of very readable stuff.

Then blanked by 403s.

FUN!

1script

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4464941 posted 5:30 pm on Jun 15, 2012 (gmt 0)

A theoretical tactic that you could use to leverage bad bots is to distribute content content that links back to your site. Not the same content that appears on your site.

Not in the post-Penguin world, you can easily negative-SEO yourself into a ban (pardon, "manual action") if you start feeding your links to bad sites like that.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4464941 posted 7:24 pm on Jun 15, 2012 (gmt 0)

Not in the post-Penguin world, you can easily negative-SEO yourself into a ban (pardon, "manual action") if you start feeding your links to bad sites like that.


only if your not aware of the difference between google and joe schmo's bean counter bot.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4464941 posted 8:00 pm on Jun 15, 2012 (gmt 0)

The problem you have screwing around with bad bots is if the bad bots actually publish your foolishness and then the good bots index your foolishness, it can all come back to bite you in the butt.

A quick for instance was a bright idea I had of tagging all links in a page with a code that ID's the source of the original page request. Fun for tracking humans using a TOR proxy, or rotating IP pools like AOL, or bots that crawl from multiple IPs, or simply tracking where the data lands. Unfortunately if you put it in the path you need to block all those paths in robots.txt, which Google WMT's will bitch about being kept away from, and if you use a parameter instead then the SE's think it's a new page.

Just plan carefully is all I'm saying or your little bit of fun could result in a trip to the burn unit.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved