homepage Welcome to WebmasterWorld Guest from 54.234.59.94
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 46 message thread spans 2 pages: 46 ( [1] 2 > >     
Stopping scrapers from the get-go
wheel




msg:4267706
 1:23 am on Feb 16, 2011 (gmt 0)

I'm putting a *huge* number of pages of content online. I'm looking to stop the scraping/copying/bots from the outset and I need bandwidth kept to a minimum. I've never done this before, so I'm not quite sure where to start.

Most of the content is on static html pages. My prelim reading suggests that may be problematic (since I'm not putting out the pages programatically).

Can anyone suggest details as to what I should be doing? Here's areas I think:
1) in htaccess, block a list of IP's from spamhaus
2) in htaccess block a large list of IP's from other countries?
3) in htaccess, block a lot of user agents (get the code from WebmasterWorld)?
4) White list Google, Yahoo and MSN in robots.txt
5) block google and the other bots from crawling my images. I think this will block all robots from crawling gif's at any level of my site?
User-agent: Googlebot
Disallow: /*.gif$

6) Then I think I'd like to block IP's from hosting companies. Is there an easy to use list of those IP's?
7) after that I should do some IP blocking dynamically I think. Like trigger a block if someone is crawling too many pages too fast. But since I'm serving static html, how do I do that? Set up a cron job to run a script every minute that reads the log and takes action? This seems complex and burdensome.
8) Since the content is static, Google and the rest don't need to download the html 8 times a month. Once a year is fine. What's the best way to tell the bots that a page hasn't changed, thus no need to crawl? etags? I think that stuff requires I change the page headers, and that's tough to do with static html pages.

Anything else I missed?

 

encyclo




msg:4267723
 2:31 am on Feb 16, 2011 (gmt 0)

What's the best way to tell the bots that a page hasn't changed, thus no need to crawl? etags? I think that stuff requires I change the page headers, and that's tough to do with static html pages.


I'll leave others to reply for the other items, but if you are dealing with static content, then Apache should be already set up to handle this situation automatically (assigning ETags, sending 304 Not Modified responses, etc.). You can also use .htaccess directives to define more specific expiry times if needed.

  • Caching Tutorial for Web Authors and Webmasters [mnot.net]
  • Apache Module mod_expires [httpd.apache.org]
  • wilderness




    msg:4267736
     3:13 am on Feb 16, 2011 (gmt 0)

    4) White list Google, Yahoo and MSN in robots.txt


    In your use of "white list" above, is it your intent to allow these SE's or deny these SE's?

    By definition and/or practice, "white list" or "white-list" denies ALL visitors, and allows only exceptions.

    Many webmasters here are using a "white-list" method in htaccess (an entirely different process than robots.txt), unfortunately there are only a few (and those offer minimal reference)examples in the Webmaster World archives. "White-listing" requires a through comprehension of browsers, UA's IP's and even other criteria, all used in a collective application to deny all and allow exceptions.
    There are not any comprehensive examples of this process in the open WWW.

    wilderness




    msg:4267742
     3:20 am on Feb 16, 2011 (gmt 0)

    5) block google and the other bots from crawling my images. I think this will block all robots from crawling gif's at any level of my site?
    User-agent: Googlebot
    Disallow: /*.gif$


    There's a more effective way to accomplish this!
    Place all your images in specific and multiple image directories, and include these image directories in robots.txt.
    The compliant SE's will abide.
    Any lack of compliance is easily noticeable and adjustments and notations would then be made on both the IP range, and the UA.

    Structure of your website (s) and consistency of methods in structure, will allow you to recognize suspicious visitors whom are traveling your website in a method outside of the built-structure (i. e., crawling).

    wilderness




    msg:4267743
     3:22 am on Feb 16, 2011 (gmt 0)

    Anything else I missed?


    Generally speaking uppercase and mixing uppercase in directory names and file names is a bad practice, however a few key directories and/or files with mixed-case names will assist you in identifying many weak bots.

    SevenCubed




    msg:4267755
     3:52 am on Feb 16, 2011 (gmt 0)

    From my recent experiences I can offer some suggestions on point 1 and 2.

    The spamhaus blacklist won't be very effective against scrapers. It is primarily intended to be used on your mail server for blocking incoming spam -- for that it is very effective. I use it and can say I like it, spam only trickles in.

    I had been stuffing my Linux iptables at the kernel level with bunches of nasty IP ranges like you want to do in htaccess. It finally built up to the point where it began causing serious performance issues. Those same performance issues would be compounded even more so through htaccess because of the constant opening of the file to read it.

    It resulted in Google slowing down their crawl frequency which compounded into sites hosted on the server loosing some positions in SERPs. It was a tough decision and trade off but I purged the IPs and left everything wide open again. Server performance skyrocketed and exactly 60 days to the day after opening everything back up Google's crawl rate and frequency went back up and lifted the sites back to where they were previously. In fact one of them went from spot #10 to #2. Coincidence? Up to you to decide but I say slow loading pages became a ranking factor.

    All that said I can send you a very effective and comprehensive list of IPs that you can use for your htaccess. It will be effective in keeping away scrapers but I know it will also slow down your site.

    If you want them PM me but I won't be able to send them until sometime tomorrow because I don't have them handy with me right here right now.

    It might not be as bad depending on how much RAM and CPU you have available. I operate at minimal because I'm just starting out on my own and will scale up to more resources as it becomes necessary.

    topr8




    msg:4267812
     8:36 am on Feb 16, 2011 (gmt 0)

    6) No, but this forum is full of pointers as to webhost blocks
    as dstiles said in another thread:
    >>I'm still picking up one or two server ranges a day but usually these are either very small ranges ...
    [webmasterworld.com...]

    7) isn't it possible to prepend a php file using http.conf or .htaccess - you could prepend something based on the famous Alex_K script which is in the php forum library
    (i must admit i've only prepended to files being parsed as php before, however i'm pretty sure you can prepend to static html too - if you can't i'm sure someone will chime in, as i wouldn't mind confirmation myself)

    8) use Apache mod_expires

    so ... your mini gutenberg project is about to go live then?

    TheMadScientist




    msg:4267916
     11:43 am on Feb 16, 2011 (gmt 0)

    1) in htaccess, block a list of IP's from spamhaus
    Cool

    2) in htaccess block a large list of IP's from other countries?
    Possibly... Definitely personal discretion IMO on this one.

    3) in htaccess, block a lot of user agents (get the code from WebmasterWorld)?
    Definitely

    4) White list Google, Yahoo and MSN in robots.txt
    Sure, but don't rely on robots for anything.
    A real scraper probably won't even visit it, let alone listen.

    5) block google and the other bots from crawling my images. I think this will block all robots from crawling gif's at any level of my site?
    User-agent: Googlebot
    Disallow: /*.gif$

    User-agent: *
    Disallow: /*.gif$

    I wouldn't do that though... If I really didn't want them crawled I'd block non-referrer sending user-agents in the .htaccess.

    RewriteCond %{HTTP_REFERER} !.+
    RewriteRule \.gif$ - [F]

    6) Then I think I'd like to block IP's from hosting companies. Is there an easy to use list of those IP's?
    Not sure on this one.

    7) after that I should do some IP blocking dynamically I think. Like trigger a block if someone is crawling too many pages too fast. But since I'm serving static html, how do I do that? Set up a cron job to run a script every minute that reads the log and takes action? This seems complex and burdensome.

    ADDED: You can usually pre-pend php to your html pages in the .htaccess so you don't need to convert all your pages to php or parse your html as php.

    The Apache Forum [webmasterworld.com] here is a great place for help with the .htaccess stuff.

    There's also a php bot blocking script or two posted here somewhere that's great. I've used it (them?) on a couple of sites. (Been a long time, so I don't remember the specifics, but you should be able to find the info you need searching around a bit.)

    There's actually a couple I think, one to time-out rapid requests and block the ip and another 'honey-pot' script that blocks the IP if the bot visits a disallowed page in the robots.txt

    8) Since the content is static, Google and the rest don't need to download the html 8 times a month. Once a year is fine. What's the best way to tell the bots that a page hasn't changed, thus no need to crawl? etags? I think that stuff requires I change the page headers, and that's tough to do with static html pages.

    ETags are usually set / sent with the server, but if you pre-pend php to your pages you can do all your headers there if necessary, including the following ... Set an Expires and Cache-Control header in the .htaccess with a long date in the future, and personally I would set a low priority on an xml sitemap for the pages too.

    SevenCubed




    msg:4268050
     5:40 pm on Feb 16, 2011 (gmt 0)

    Wheel I've sent the list. A practical way in which it may be effective is to apply it until your pages get indexed and get its due recognition as the original source of the info. Then if performance is an issue you can trade off some security for performance. That way your site gets a fighting chance.

    ken_b




    msg:4268061
     5:58 pm on Feb 16, 2011 (gmt 0)

    If you block crawling/indexing/whatever your images you need to think about Googles "preview" function.

    Block G from images and your pages are going to look like the site is abandoned in the preview view. Been there, done that, bad news. Logos, product images, info images you name the image, or image type, how will your page look without them?

    Would you click through to a page where all the images were obviously missing? Bye, bye human traffic....!

    At the very least, consider putting a little smaller and/or lower quality image on the page, and make that image clickable leading to a better/larger image in a restricted directory.

    dstiles




    msg:4268154
     8:49 pm on Feb 16, 2011 (gmt 0)

    As SevenCubed said, spamhaus is not a good idea. Mostly it's dynamic broadband IPs - the very traffic you want (in the UK - BT and NTL, in the US RoadRunner and Charter etc).

    Possibly use the spamhaus DBL and Drop list - these are domains used for hosting exploits - but it's a time-delay probably not worth including, especially as most of the exploited domains run from a relatively small range of IPs which can easily be blocked anyway. Also, in my experience, exploited servers do not seem to cross over between mail and botnet.

    I block countries according to the requirements of various web sites. Here in the UK I have several sites that do not need traffic from Asia, South America and the Russia/Ukraine/Romania block. On the other hand, I also have clients who want traffic from such regions so blocking by country for me has to be selective.

    My own blocklist is run through MySQL and, as referred to above, I add a few new ranges every day - now generally small ranges. I would have preferred to use htaccess as that pevents bad bots actually seeing the page at all, but on my Windows 2003 server I'm stuck with blocking content, with a few really bad exceptions that go into the IIS blocklist.

    Total MySQL database size is currently about 20,000 records (collecting for about 12 months using MySQL), but about half of those are single dynamically blocked IPs which are over-due for removal. Typical introduced delay for the first page is 15-30 milliseconds; if sessions are enabled on the browser then faster for subsequent hits.

    If you have time/experience it may pay to see if MySQL can feed htaccess - I do not have significant experience of htaccess so I don't know if it's possible, but I would have thought it possible to use a dynamic htaccess based on the IP and UA.

    Megaclinium




    msg:4268251
     1:47 am on Feb 17, 2011 (gmt 0)

    I have my images set to avoid scraping. In the control panel, you simply put extensions that are only available from your own web pages. I have .jpg this way.

    So if YOUR website accesses the image directly (via user requesting your web page) the image is transmitted.

    If a image harvester tries to directly get your .jpg NOT from one of your web pages, the image won't transmit (gets a 302 back to the image harvester bot).

    It's simple and effective.

    Of course I also do what they are suggesting above, ban hosting ranges. I only do this when they start to hit my site.

    My program to analyze the raw logs sorts the hits and filters out either individual IP#s or ranges I've already classified as bots into a separate file.

    The result is that any NEW hits come to a separate file and I see them immediately. Because of this I even see widely separated IP#s that are nearby in time and, from the UA apparently the same person trying from their banned range and their non-banned range. (they had scraped me from the banned range and I 403'd them)

    dstiles




    msg:4268621
     7:53 pm on Feb 17, 2011 (gmt 0)

    Not everyone has this kind of feature in a Control Panel. Most of us have raw servers to deal with. :)

    JohnRoy




    msg:4268753
     12:03 am on Feb 18, 2011 (gmt 0)

    > Anything else I missed?

    incrediBILL would most probably have some thoughts to add on this issue.

    wheel




    msg:4268764
     12:47 am on Feb 18, 2011 (gmt 0)

    I've got a large list of IP's someone was kind enough to give me. I believe they're of scrapers and rough areas of the world. I've implemented them in a .htaccess.
    I've also implemented a banned list of UA's I found online that funnily enough was developed from some of the threads here on WebmasterWorld.
    I've whitelisted the three main SE's, and banned everyone else.
    I've thrown a honey pot URL into the robots.txt as well. For now I'll just watch my logs and ban any IP that accesses that URL. I may eventually drop a script into the directory as the index that either emails me every access, or auto-ads the URL to the .htaccess file.

    I'm going to build the content behind the scenes, then put it online in one big dump. THen I'm going to immediately start building links to it. Hopefully that's a good start for getting this online and running. I'll add the header info later, as that's going to take some programming.

    JohnRoy




    msg:4268819
     4:22 am on Feb 18, 2011 (gmt 0)

    > content is on static html pages

    If I may ask. Why can't this be transferred to some CMS that would make changing headers and footers much much much easier.

    wheel




    msg:4268979
     11:50 am on Feb 18, 2011 (gmt 0)

    I ended up with static HTML just because of the way the content was generated. It's way too much work/effort/money to put this into a cms right now.

    blend27




    msg:4269020
     1:53 pm on Feb 18, 2011 (gmt 0)

    I've been looking into CMS idea that would be serving the content to whitelisted bots as usual and serving the rest of the visitors of the site content via jQuery/Ajax lately. Can't mention any names here but there is a Forum Software/Platform that utilizes it.

    One of the Geek communities I hang out with had a NNTP forum for 10 years before they moved to that platform. NNTP often crashed due to leeches scraping it so the owner moved to that platform. It would/could seen as a bit "cloak..ish" but at this point I see the forum ranks for what it needs to in G, M and Y, even ASK. But when accessed by real browser the content is all jQ and Ajax, e.g. on JS = no content and bots, at least the current scraping ones can't parse that.

    jdMorgan




    msg:4269082
     4:32 pm on Feb 18, 2011 (gmt 0)

    One thing you can do to limit the performance impact of long and complex access-control code sections is to be very selective about how and when you invoke them.

    For an example specific to the anti-scraper access control being discussed here, consider this: It is not necessary to control access to images, CSS files, JavaScript files, or other objects included on the page, it is only necessary to control access to the page itself.

    This is because if the scraper cannot fetch the page, then it cannot know the URLs of the objects included on that page, and so cannot fetch them either.

    So, to expand on what wilderness stated above, you can put your 'pages' into one subdirectory, put your images and other included objects into one or more additional subdirectories, and then put the really heavy access-control code into the 'page' subdirectory only. On the other hand, you could put your anti-hotlinking code only into these other included-object directories.

    Be aware that by using mod_rewrite or ISAPI Rewrite, you can avoid having to include these new subdirectory paths in your URLs; just rewrite the URL example.com/logo.gif to the filepath /images/logo.gif without changing the URL at all.

    Because .htaccess files are "per-directory" configuration files, you can use subdirectories to control when and how access-control code is executed. This can mitigate the server performance issues of large access-control lists.

    An aside: I tend to emphasize .htaccess file techniques simply because the majority of Web sites are hosted on name-based virtual servers, where server-level config files are not accessible to the Webmaster. For those on dedicated or virtual private servers, the technique described above can be applied even more efficiently by putting the access-control code into <Directory> sections in your httpd.conf or other server-level configuration file(s) instead of into individual .htaccess files in the subdirectories themselves.

    Jim

    wheel




    msg:4271609
     5:40 pm on Feb 24, 2011 (gmt 0)

    Thanks for all your help everyone.

    A follow up question. I'm trying to limit crawling to some extent due to the huge volume of pages. The host I'm using (a local company) has me on a plan limited to 100gigs/month. I bet I've got almost 15 gigs of data online. So if all three search engines scrape me twice a month, I'm almost out of bandwidth.

    As a result, I set the crawl-delay to 5. Now I'm wondering if that's going to prevent the site from really being indexed.

    On a new site, with many tens of thousands of pages of new content, should I use that crawl delay? Or should I remove it and let them go nuts?

    JohnRoy




    msg:4271983
     3:06 am on Feb 25, 2011 (gmt 0)

    While I don't know what I would've done in regards to the crawl delay setting -

    There are many hosting providers out there with unlimited space and unlimited bandwidth.

    Is there a specific reason to keep it local?

    wheel




    msg:4271997
     3:23 am on Feb 25, 2011 (gmt 0)

    Yes, I wanted to go with someone not related to my normal hosting environment. And I wanted someone local because they're local. And I went with this guy specifically because he's a member of a local user group I belong to and he's helped me out in the past. It's less than ideal for this project, but forcing some limitations is a good learning experience.

    SevenCubed




    msg:4272002
     3:34 am on Feb 25, 2011 (gmt 0)

    The best solution to save bandwidth is to apply compression to your outgoing files. It typically saves about 70% bandwidth.

    SevenCubed




    msg:4272005
     3:42 am on Feb 25, 2011 (gmt 0)

    Are you on a Linux or Windows host?

    SevenCubed




    msg:4272021
     4:05 am on Feb 25, 2011 (gmt 0)

    I'm about to shut it down for the night so in case you come back here's a solution if you are on a Linux host. If you are hosted on a Windows box, sorry I cannot help you.

    Add this to the end of your .htaccess file:

    <IfModule mod_deflate.c>
    <FilesMatch "\.(php|css|html|xml|txt|js)$">
    SetOutputFilter DEFLATE
    </FilesMatch>
    </IfModule>

    You'll need to ensure that your hosting provider has the mod_deflate module loaded. If not your compression will not work but it will fail gracefully because it is wrapped within the IfModule mod_deflate.c directive.

    If you know positively that module is loaded then you can minimize it to:

    <FilesMatch "\.(php|css|html|xml|txt|js)$">
    SetOutputFilter DEFLATE
    </FilesMatch>

    Modify the above example with whatever file extensions you are using. Order them from most commonly requested to least commonly requested. DO NOT add image files to the list, they are already compressed files and it will make things worst.

    One last thing. The above example is for Apache 2. If you are hosted on an earlier version the same can be done using mod_gzip but I haven't tested that code specifically on Apache 1.3 so I cannot be sure it will work properly.

    youfoundjake




    msg:4272081
     6:19 am on Feb 25, 2011 (gmt 0)

    Look at maybe setting up a bot trap as well for those that don't obey robots.txt.

    tangor




    msg:4272083
     6:30 am on Feb 25, 2011 (gmt 0)

    Best way to stop the leeches is to not post content. :)

    Seriously, all above are good, tried and true methods of slowing the scrape but none will STOP it. Scraping is a cost of business these days and it will only get worse.

    wiian




    msg:4272197
     10:39 am on Feb 25, 2011 (gmt 0)

    I would prefer merging 1,2,3,6,7,8 into a single step.

    rewrite *.html to a php file

    On the php file

    if bad ip or bad agent
    block, exit;
    if too many requests
    record ip, ban/ block, exit;

    // else all clear
    set appropriate headers
    include the html file

    blend27




    msg:4272369
     4:20 pm on Feb 25, 2011 (gmt 0)

    This is the way I program it on the sites I work on these days:

    1. Basic .htaccess takes care of the Mailformed/KNown Scraper UAs, blacklisted and Abusive Bot UAs.(access data gets recorded, mostly IPs that are candidates for the Blacklisted IP ranges to look at later)

    I am on IIS Servers so canít use the goodies that come with Apache.

    2. There is a query in memory that holds White Listed Bots IP Ranges(a few only), so I compare the IP if pass, record it(all the good stuff)

    If not in those ranges:

    3. IP Gets checked against known COLO/HOSTING/and some manually Blacklisted ranges. If in those ranges, fist page served is a small human check with Captcha Style question which is generated on the fly, the contents and style of the page is Random.


    4.
    a. UA Check Again
    b. IP Checked against DB index from Banned Table
    c. IP Range Check against White Listed Ranges(yes I only check for white listed at this point), access form OTHER Ranges gets recorded.

    step 4 is wrapped in SPEED trap that controls in/out content, if triggered, IP Gets banned for a specific time.

    If passed, content is served(including bot trap links), but wait, there is more.....

    5. I design my sites and use lots of CSS and JS, I use background images that are generated/served via a server side script, then those images and files are mentioned in HTML, External CSS and External JS, so IF those files are not requested the TRAP Page(human check) get served after 2 pages of HTML is served and until proven Innocent, the visitor is Guilty.

    I block first and ask question later and not afraid to loose a visitor if something quacks like a duck.

    acemi




    msg:4272376
     4:30 pm on Feb 25, 2011 (gmt 0)

    Using ConfigServer Security & Firewall and mod_security really keeps a lot of unwanted visitors/ bots out.

    This 46 message thread spans 2 pages: 46 ( [1] 2 > >
    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved