homepage Welcome to WebmasterWorld Guest from 54.198.33.96
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 42 message thread spans 2 pages: < < 42 ( 1 [2]     
What's the Best Way to Keep All Spiders/Bots Out?
Only want a couple of pages crawled
inbound




msg:4369766
 3:34 pm on Oct 2, 2011 (gmt 0)

I'm launching a new site that I don't want in any search engine (apart from a few static pages) - and I certainly don't want people running bots against it (as each page will have one/several BOSS API calls, at a cost to me).

I'm aware of several techniques such as:

Rate limiting
Honey Pots
IP Banning
UA Banning

But there's a very specific issue, people are likely to want to "crawl" the site by doing lots of search queries rather than link-based crawling. Another thing (that may help a bit) is that users are going to be UK only - but I might want to allow US,CA,IE,AU,NZ usage too (mainly so webmasters/press can try it and write about it).

Effectively, it's the same issue that Search Engines have and I'm unsure of the best way to deal with several things (such as IP addresses that have many simultaneous users - is AOL still like that?). I don't want to use ANY feature that could be seen as a privacy issue (so no cookie dropping without permission - although I'd be happy to use a short-lived server side identifier).

I'm happy to read as much as required (so links to older, but still valid, threads would be handy too).

 

moTi




msg:4370682
 6:28 pm on Oct 4, 2011 (gmt 0)

you don't need all that crap, guys. you need to be creative!

just store the visitor ip in a database 1. then use a javascript that loads an iframe on your page that saves the ip of a visitor in database 2.

so database 1 is every visitor including bots, whereas database 2 is only real human visitors.

now if something visits your website, do a request if it has already shown up in database 1 (since a certain period). do that, because if it's a repeat visitor and a human, you can be sure that the iframe on your page has definitely loaded at least once.

if it is a repeat visitor, then do a request on database 2. if its ip is not in database 2 (since the same period), it's a bot. if it's a bot, send a 403. can you dig that?

to make it 100% fool proof for your human visitors, you should also add the iframe to your 403 error page if something goes wrong.

so the only disadvantage of this approach is, that you have to live with the one single pageview that a bot generates to determine if it's a bot. you can also unblock certain bots you want to crawl your pages per reverse dns lookup.

works like a charm for me.

wilderness




msg:4370704
 7:04 pm on Oct 4, 2011 (gmt 0)

so the only disadvantage of this approach is,


Wrong!

The other disadvantage is that your method is more server CPU intensive likely even slower).

dstiles




msg:4370728
 8:18 pm on Oct 4, 2011 (gmt 0)

I have mentioned this several times in various forums hereabouts, including this one, and been largely ignored:

1. Not all people browse with javascript/ajax turned on - it's dangerous and the default in NoScript and similar is to disable it until requested by the user. Same applies to a lesser extent with frames, especially iframes. Many people also turn off or ignore cookies, in case anyone thinks of that as a solution (and some bots accept them anyway).

2. Bots can very easily ignore javascript and (i)frames and very few, I think, read javascript files or follow javascript links unless they are looking for email addresses (which a lot of people hide with javascript and are surprised when bot-scraped).

Wilderness - I assume that is only a subset of the UK IPs - there are many groups including subsets in the 217/8, 46/8 and others. For anyone using your list, note that two or three escape codes are incorrect.

inbound - I doubt most scrapers look for references to themselves. They are far more interested in creating content copies of sites that rank well in google - one of google's many contributions to reasons for web scraping and spam. There are several reasons for site-scraping but I really do not think narcism is seriously one of them.

inbound




msg:4370736
 8:29 pm on Oct 4, 2011 (gmt 0)

You've been provided multiple solutions by longtime participants in this forum, and yet, rather than tackling the work and solutions, your still looking for a one-shot copy and paste solution where no such thing exists.


I think you may have misinterpreted my intentions, probably due to me not being able to adequately explain myself. I am very grateful for the solutions/suggestions and by no means am I looking for anyone to fully "solve" the issue for me. I know there will be work on my behalf, mainly on the side of testing efficacy and resources required.

I suppose a better way to ask my previous question is this:

If I do all of the above, is there any point in doing more?

I suspect the answer to that is no, but (and here's where I'll do a bad job of describing things again) I'm also asking if there's any point in offering an API to those that want to get to certain (limited) data so that they don't resort to scraping.

wilderness




msg:4370754
 9:02 pm on Oct 4, 2011 (gmt 0)

Wilderness - I assume that is only a subset of the UK IPs - there are many groups including subsets in the 217/8, 46/8 and others. For anyone using your list, note that two or three escape codes are incorrect.


Their first 28 lines converted to RegEx (I'm not about to go through 5500 lines of conversion, just to be a nice guy or make a point.


The examples were just meant to provide insight into the complexity of smaller UK IP ranges. All that I provided are with the 109 Class A, and there were more.

Syntax errors (and the required rechecking) are synonymous with regex.

tangor




msg:4370757
 9:15 pm on Oct 4, 2011 (gmt 0)

those that want to get to certain (limited) data so that they don't resort to scraping.

In a perfect world there would be no scraping. In this world, anything with can be displayed can be saved (scraped) and not all the bad actors are bots. You'll also want to disallow all site scraper programs like xenu, htttpweb, etc...

Password (best) or captcha (not nearly as good) those folders/info to be protected and NEVER NEVER NEVER link from those protected pages to your public side.

Your users can undo all your protection schemes either innocently or deliberately.

If this material is that important, don't share it on the web without passwords UNIQUE PER VISITOR and retained via IP, etc (and other markers).

wilderness




msg:4370768
 9:57 pm on Oct 4, 2011 (gmt 0)

If I do all of the above, is there any point in doing more?


Continued monitoring of your raw visitor logs is an essential part of being a webmaster.
ALL webmasters should certainly spend some time there. so that they become accustomed to simple things that JUMP out to the eye.

IP's change often enough that IP ranges (assigned or otherwise) are not permanent. Companies, servers, providers and even websites simply go out of business or are purchased by other business', whom may not have a need for the same IP range.

Pfui




msg:4370881
 3:01 am on Oct 5, 2011 (gmt 0)

@inbound: The title of this thread specifies spiders and bots but it looks like you're equally, if not more, concerned about what real people do with your content. Regardless, there is no perfect solution, even if it's comprised of layers and hurdles and traps and what-have-yous, even if you throw an eye-popping amount of money at programmers to engineer solutions.

Why? Because of this old news:

Spiders, scrapers, and hackers happen. Ditto real people innocently, or intentionally, doing things you dislike, didn't intend, and/or didn't see coming.

Thing is, we've spitballed lots of pretty detailed how-tos in this thread yet keep falling frustratingly short of your (albeit vague) specifications. So maybe it's butt-in-chair time for you, time to dig in and start making what will, sooner or later, be the best solution for your stuff, your server, your life?

(If not, heck, go hire jdMorgan:)

Seb7




msg:4370973
 10:35 am on Oct 5, 2011 (gmt 0)

I have some experience e with this, as my previous project I wanted pages and other code hidden. The answer is that you can't completely 100% block, but you can get close.

One method I've used before is to check for mouse movement, and Ajax back to server if it moved, along with other indicators. Set a session variable for real users, back pages that don't have it set.

dstiles




msg:4371140
 4:49 pm on Oct 5, 2011 (gmt 0)

Wilderness - sorry, I should have read your posting more carefully. Thanks. :)

Seb7 - see my previous posting re: javascript and ajax. A LOT of people do not use them so you won't get mouse movements so you will lose visitors.

wilderness




msg:4371170
 5:44 pm on Oct 5, 2011 (gmt 0)

Seb7 - see my previous posting re: javascript and ajax. A LOT of people do not use them so you won't get mouse movements so you will lose visitors.


Affirmative.

I've encouraged most everybody I know to navigate the majority of websites with java OFF.

lucy24




msg:4371233
 8:30 pm on Oct 5, 2011 (gmt 0)

Does "mouse movement" mean "cursor movement"? What's being measured? You don't want to lock out humans with physical disabilities. Or, at least, you don't want to lock them out on the grounds that you think they're not human. Sites that simply "don't work" are an annoyance but it's something you learn to live with.

This 42 message thread spans 2 pages: < < 42 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved