homepage Welcome to WebmasterWorld Guest from 54.227.67.210
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
SMBot
Crawling from Amazon.com in South Africa
GaryK




msg:3199047
 8:44 pm on Dec 24, 2006 (gmt 0)

SMBot/1.1 (www.specificmedia.com)
216.182.230.228
domU-12-31-33-00-03-2E.usma1.compute.amazonaws.com
-----
OrgName: Amazon.com, Inc.
OrgID: AMAZO-4
Address: Amazon Development Centre South AFrica

Did not request robots.txt. Not sure if it needs to. It appears to be related to affiliate advertising.

 

GaryK




msg:3199516
 8:12 pm on Dec 25, 2006 (gmt 0)

Update: Bill commented about this on his blog.

hybrid6studios




msg:3212919
 5:11 am on Jan 9, 2007 (gmt 0)

This bot is worse than you think.

User-Agent: "SMBot/1.1 (www.specificmedia.com)"

I'm a little ticked off. Specific Media has using SMBot to rip complete web sites, repeatedly. In the last couple days it hit 2 of my sites several hundred times each. Specific Media is using this bot to data-mine your site for information that can be used by their advertisers and advertising network. They are making a profit off your hard work! Without your permission of course. Can someone say Digital Millennium Copyright Act violations? SMBot completely disregards Robots.txt standards as well. At first their bot was crawling around without a user agent. (Fellow bot-hunter IncrediBILL has more info on SMBot.)

[edited by: volatilegx at 8:59 pm (utc) on Jan. 9, 2007]
[edit reason] trimmed post to remove call to action [/edit]

wilderness




msg:3212969
 6:59 am on Jan 9, 2007 (gmt 0)

hybrid,
As previously suggested, redirecting bots is a BAD idea.

plain old denial is much more effective.

hybrid6studios




msg:3213305
 1:40 pm on Jan 9, 2007 (gmt 0)

wilderness...please elaborate.

wilderness




msg:3213342
 2:31 pm on Jan 9, 2007 (gmt 0)

redirecting "bad" bots to either their own websites, anothers website or alternative images is a bad practice.

Any attempt at such redirecting of bad bots simply causes more problems for yourself in the future.

Suggest changing your lines to a simple denial of access.

RewriteCond %{HTTP_USER_AGENT} ^SMBot [NC]
RewriteRule .* - [F]

Jim (as I recall) has an alternative that sends them off to a page which uses less kb's than even a simple 403.

Don

hybrid6studios




msg:3213751
 7:26 pm on Jan 9, 2007 (gmt 0)

Hmmm...you may be right...It's just getting ridiculous with these guys though. Hammering people's web sites with a bad bot is a bad practice too.

I'll revert to a 403. I'd be interested in Jim's alternative method.

incrediBILL




msg:3213765
 7:32 pm on Jan 9, 2007 (gmt 0)

I don't send them away and I don't send them 403s either.

I feed them a page with breadcrumbs so I can see where the data shows up if it's ever indexed by a search engine.

hybrid6studios




msg:3213979
 10:49 pm on Jan 9, 2007 (gmt 0)

Interesting...what do you mean by breadcrumbs?

incrediBILL




msg:3214385
 7:56 am on Jan 10, 2007 (gmt 0)

Breadcrumbs - a piece of data so unique that it has NO search results and contains a session ID that ties the data back to the original IP and user agent.

For instance, a nonsense code to search like "AAVVQQAA" plus the key to link them back to the crawling event. Hyphenate the code so the search engine will see the first part as uniquely searchable so it would look like "AAVVQQAA-12276021092" or however you do it.

Looks kind of like a part # when it's assembled :)

I never show breadcrumbs to search engines, just cloaked to the rest of the world to track my data.

I use CSS to hide them in the browser, people never see these, but the crawlers strip out the html and VOILA! they are exposed on the scrapers websites.

[edited by: encyclo at 1:12 am (utc) on Jan. 12, 2007]
[edit reason] fixed typo per request [/edit]

volatilegx




msg:3214730
 3:03 pm on Jan 10, 2007 (gmt 0)

> I never show breadcrumbs to search engines, just cloaked to the rest of the world to track my data.

Yet another fantastic reason to cloak! ;)

hybrid6studios




msg:3215243
 8:51 pm on Jan 10, 2007 (gmt 0)

IncrediBILL: Brilliant. :)

volatilegx: Yes, very true. I'd say that's definitely a valid use.

GaryK




msg:3216583
 9:06 pm on Jan 11, 2007 (gmt 0)

When I initially learned about this it was marked, for your eyes only. I think posting it in public will ultimately be self-defeating. Scrapers now know to check for CSS that makes an element seem invisible. From there it's not a lot of work to strip out anything that's invisible.

Please always remember it's not just white hats who read this forum! :)

wilderness




msg:3216614
 9:27 pm on Jan 11, 2007 (gmt 0)

I think posting it in public will ultimately be self-defeating.

Gary,
Many of us are on agreement along these lines, however closing our doors and insights to others, makes it quite impossible to share something that most webmasters do not utilze. Seems a do or die dilema?

This a good example of my first awareness of such monitoring of Webmaster World some three years ago.

[webmasterworld.com...]

Personally, I'm unable to recall when a bad-bot was able to spider my entire sites successfully. Nor am I aware of it (spidering) being done in cloaked and/or unidetified manner.

I'm not saying that complete spidering doesn't still occur, only that successful use of rewrites with htaccess has been successful in deterance on my sites.

Should we close our doors to newcomers willing to learn and create a "good ole boys" method of communication?

Don

GaryK




msg:3216662
 10:10 pm on Jan 11, 2007 (gmt 0)

Should we close our doors to newcomers willing to learn and create a "good ole boys" method of communication?

No of course not. :)

Still, I feel so exposed when information like this becomes common knowledge.

I'm a conflicted man. No doubt about it. ;)

wilderness




msg:3216671
 10:16 pm on Jan 11, 2007 (gmt 0)

I'm a conflicted man. No doubt about it.

Gary,
Jim's in charge of the dispensary ;)

[webmasterworld.com...]

If no comfort there?
Try here:
[herbalrescue.co.nz...]

If no comfort either location?
try some regular "herb" ;)

Don

GaryK




msg:3216680
 10:26 pm on Jan 11, 2007 (gmt 0)

I could have stopped with Jim. I loved that thread. But since you offered so many choices I think I'll go directly from the top of the list to the bottom. ;)

incrediBILL




msg:3216738
 11:29 pm on Jan 11, 2007 (gmt 0)

I think posting it in public will ultimately be self-defeating

Gary, what was for YOUR EYES ONLY was the specific codes I'm using, that's still just for you! :)

Early on I was just experimenting with the tactic and have since shown the results of how this works at both SES and PubCon, so the cat is out of the proverbial bag. I also came up with other ways of implementing or randomizing, including not using CSS, so that it's virtually impossible for scrapers to code for this technique.

For instance, you can even embed a specific visible phrase and use it to track content such as "aardvark and centipede farts" which currently return no results. You could make it completely visible in small type on the page such as "Silly factoid#12276021092: Did you know that the aardvark and centipede farts?"

Then just hit the SE's looking for the exact phrase "aardvark and centipede farts" and "silly factoid" and sure enough the related code pops out.

The only problem I've run into since busting some scrapers is that they took a page out of my book and started using NOARCHIVE so you can't snoop their cloaked pages in search engine cache. The solution to this problem was to bind the session ID code to a specific word so that Google will display the word plus the session code in the snippet. That's why I always have a phrase plus a keyword so that if the scraper scrambles the content, which many do, I can pull it back together in the results.

So what next, they start looking for long numbers and filter them out?

Fine, I can switch to HEX or BASE36, or a completely alpha variant so it will look like a word instead of a number.

Besides, scrapers are like rats, the smart ones take your cheese and leave an empty trap, but we can still enjoy snaring all the stupid ones while it lasts.

[edited by: incrediBILL at 12:07 am (utc) on Jan. 12, 2007]

incrediBILL




msg:3216760
 11:42 pm on Jan 11, 2007 (gmt 0)

Scrapers now know to check for CSS that makes an element seem invisible.

Maybe it's a font tag changing the color opposed to CSS
Maybe I embedded it in a 1 pixel iFrame lurking in the page
Maybe I faked the code honeypot-style into the URL designed for scrapers to followed to a special page so I can search for anyone that links to "you_stupid_scraper_12276021092.html"
Maybe the code is hidden in the title as "PAGE 122760 of 21092".
Maybe it's all of the above!

You never know what I'll do next ;)

The point is that there are many tricks and this is just scraping (pun intended) the surface.

[edited by: incrediBILL at 12:08 am (utc) on Jan. 12, 2007]

hybrid6studios




msg:3216832
 1:12 am on Jan 12, 2007 (gmt 0)

Again, Brilliant. thank you. I'm current refining my anti-bot strategy and will be working on a unique implementation of that concept.

GaryK




msg:3216837
 1:14 am on Jan 12, 2007 (gmt 0)

You never know what I'll do next

You and I both know how true that is. ;)

hybrid6studios




msg:3216864
 1:35 am on Jan 12, 2007 (gmt 0)

Bill, Don, Gary, Dan: I know I'm new to this forum, but thank you all for your open advice...You guys rock. I'm already a pretty good bot-fighter, but there is a ton more to learn and you guys have been at it longer. It seems like the more I learn the more I realize there is to know. :) Again, thanks.

volatilegx




msg:3217498
 5:14 pm on Jan 12, 2007 (gmt 0)

the more I learn the more I realize there is to know

The sign of a wise /(wo)?man/.

(sorry for the regex, couldn't resist)

hybrid6studios




msg:3217654
 8:08 pm on Jan 12, 2007 (gmt 0)

LOL. :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved