Forum Moderators: open
User-Agent: "SMBot/1.1 (www.specificmedia.com)"
I'm a little ticked off. Specific Media has using SMBot to rip complete web sites, repeatedly. In the last couple days it hit 2 of my sites several hundred times each. Specific Media is using this bot to data-mine your site for information that can be used by their advertisers and advertising network. They are making a profit off your hard work! Without your permission of course. Can someone say Digital Millennium Copyright Act violations? SMBot completely disregards Robots.txt standards as well. At first their bot was crawling around without a user agent. (Fellow bot-hunter IncrediBILL has more info on SMBot.)
[edited by: volatilegx at 8:59 pm (utc) on Jan. 9, 2007]
[edit reason] trimmed post to remove call to action [/edit]
Suggest changing your lines to a simple denial of access.
RewriteCond %{HTTP_USER_AGENT} ^SMBot [NC]
RewriteRule .* - [F]
Jim (as I recall) has an alternative that sends them off to a page which uses less kb's than even a simple 403.
Don
For instance, a nonsense code to search like "AAVVQQAA" plus the key to link them back to the crawling event. Hyphenate the code so the search engine will see the first part as uniquely searchable so it would look like "AAVVQQAA-12276021092" or however you do it.
Looks kind of like a part # when it's assembled :)
I never show breadcrumbs to search engines, just cloaked to the rest of the world to track my data.
I use CSS to hide them in the browser, people never see these, but the crawlers strip out the html and VOILA! they are exposed on the scrapers websites.
[edited by: encyclo at 1:12 am (utc) on Jan. 12, 2007]
[edit reason] fixed typo per request [/edit]
Please always remember it's not just white hats who read this forum! :)
I think posting it in public will ultimately be self-defeating.
Gary,
Many of us are on agreement along these lines, however closing our doors and insights to others, makes it quite impossible to share something that most webmasters do not utilze. Seems a do or die dilema?
This a good example of my first awareness of such monitoring of Webmaster World some three years ago.
[webmasterworld.com...]
Personally, I'm unable to recall when a bad-bot was able to spider my entire sites successfully. Nor am I aware of it (spidering) being done in cloaked and/or unidetified manner.
I'm not saying that complete spidering doesn't still occur, only that successful use of rewrites with htaccess has been successful in deterance on my sites.
Should we close our doors to newcomers willing to learn and create a "good ole boys" method of communication?
Don
I'm a conflicted man. No doubt about it.
Gary,
Jim's in charge of the dispensary ;)
[webmasterworld.com...]
If no comfort there?
Try here:
[herbalrescue.co.nz...]
If no comfort either location?
try some regular "herb" ;)
Don
I think posting it in public will ultimately be self-defeating
Gary, what was for YOUR EYES ONLY was the specific codes I'm using, that's still just for you! :)
Early on I was just experimenting with the tactic and have since shown the results of how this works at both SES and PubCon, so the cat is out of the proverbial bag. I also came up with other ways of implementing or randomizing, including not using CSS, so that it's virtually impossible for scrapers to code for this technique.
For instance, you can even embed a specific visible phrase and use it to track content such as "aardvark and centipede farts" which currently return no results. You could make it completely visible in small type on the page such as "Silly factoid#12276021092: Did you know that the aardvark and centipede farts?"
Then just hit the SE's looking for the exact phrase "aardvark and centipede farts" and "silly factoid" and sure enough the related code pops out.
The only problem I've run into since busting some scrapers is that they took a page out of my book and started using NOARCHIVE so you can't snoop their cloaked pages in search engine cache. The solution to this problem was to bind the session ID code to a specific word so that Google will display the word plus the session code in the snippet. That's why I always have a phrase plus a keyword so that if the scraper scrambles the content, which many do, I can pull it back together in the results.
So what next, they start looking for long numbers and filter them out?
Fine, I can switch to HEX or BASE36, or a completely alpha variant so it will look like a word instead of a number.
Besides, scrapers are like rats, the smart ones take your cheese and leave an empty trap, but we can still enjoy snaring all the stupid ones while it lasts.
[edited by: incrediBILL at 12:07 am (utc) on Jan. 12, 2007]
Scrapers now know to check for CSS that makes an element seem invisible.
Maybe it's a font tag changing the color opposed to CSS
Maybe I embedded it in a 1 pixel iFrame lurking in the page
Maybe I faked the code honeypot-style into the URL designed for scrapers to followed to a special page so I can search for anyone that links to "you_stupid_scraper_12276021092.html"
Maybe the code is hidden in the title as "PAGE 122760 of 21092".
Maybe it's all of the above!
You never know what I'll do next ;)
The point is that there are many tricks and this is just scraping (pun intended) the surface.
[edited by: incrediBILL at 12:08 am (utc) on Jan. 12, 2007]