homepage Welcome to WebmasterWorld Guest from 54.205.247.203
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
stremorbot
lucy24




msg:4566747
 12:30 am on Apr 21, 2013 (gmt 0)

74.125.184.17 - - [20/Apr/2013:03:22:58 -0700] "GET /robots.txt HTTP/1.1" 200 616 "-" "stremorbot AppEngine-Google; (+http://code.google.com/appengine; appid: s~stremor-crawler)"
74.125.184.17 - - [20/Apr/2013:03:22:58 -0700] "GET /robots.txt HTTP/1.1" 200 616 "-" "stremorbot AppEngine-Google; (+http://code.google.com/appengine; appid: s~stremor-crawler)"

... and that was all she wrote.

Maybe I should phrase it as a generic question: Does the element "appengine" in the UA string ever point to something good and useful? Or is it simpler just to lock 'em all out and not think about it any further? Why is g letting other people's robots crawl under their umbrella anyway?

 

wilderness




msg:4566766
 1:08 am on Apr 21, 2013 (gmt 0)

RewriteCond %{REMOTE_ADDR} ^74\.125\. [OR]

lucy24




msg:4566779
 4:37 am on Apr 21, 2013 (gmt 0)

:)

I went with

BrowserMatch AppEngine keep_out

after a closer look revealed that the full package was-- don't know how I missed this the first time--

74.125.184.17 - - [20/Apr/2013:03:22:58 -0700] "GET /robots.txt HTTP/1.1" 200 616 "-" "stremorbot AppEngine-Google; (+http://code.google.com/appengine; appid: s~stremor-crawler)"
74.125.184.17 - - [20/Apr/2013:03:22:58 -0700] "GET /robots.txt HTTP/1.1" 200 616 "-" "stremorbot AppEngine-Google; (+http://code.google.com/appengine; appid: s~stremor-crawler)"
74.125.184.17 - - [20/Apr/2013:03:22:58 -0700] "GET /hovercraft/hovercraft.html HTTP/1.1" 200 15785 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: s~liquid-helium)"
74.125.184.17 - - [20/Apr/2013:03:22:58 -0700] "GET /hovercraft/hovercraft.html HTTP/1.1" 200 15785 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: s~stremor-crawler)"
74.125.184.17 - - [20/Apr/2013:03:22:59 -0700] "GET /hovercraft/hovercraft.html HTTP/1.1" 200 15785 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: s~stremor-crawler)"
74.125.184.17 - - [20/Apr/2013:03:22:59 -0700] "GET /hovercraft/hovercraft.html HTTP/1.1" 200 15785 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: s~liquid-helium)"


Liquid Helium to S-Tremor: "Hey, could you pick up a copy of robots.txt for me too as long as you're there?"

Now, if only I knew what makes the /hovercraft/ directory so entrancing to robots... I've even got a minor botnet that consistently asks for its index page after picking a 403 on some random other page (fake-referer block on interior pages). Been coming around for a month or two at least. If they start repeating themselves, I look up the IP and block appropriately.

Don, do you really do all your IP blocks in mod_rewrite?

wilderness




msg:4566780
 5:10 am on Apr 21, 2013 (gmt 0)

do you really do all your IP blocks in mod_rewrite


Nay.

I converted many to CIDR and place multiples on same lines in deny from, which reduced my htaccess size by approximately 25%.

Unfortunately (as as explained previously), I'm unable to comprehend CIDR ranges at a glance (or even using all my fingers, toes, and six calculators), thus I'm reluctant to switch completely.

keyplyr




msg:4566781
 5:32 am on Apr 21, 2013 (gmt 0)



I filer UAs that contain "appid" as well as "crawler|cache|copier|etc..." allowing only few IPs through. I don't block this Google IP per se.

Samizdata




msg:4566786
 11:26 am on Apr 21, 2013 (gmt 0)

I don't block this Google IP per se

My approach too - as well as the translator and anonymous automated checks they use it for the Adwords Keyword Tool and Google-Sitemaps.

Not everyone will want to block the last two (though Don can probably live without them).

...

wilderness




msg:4566834
 6:00 pm on Apr 21, 2013 (gmt 0)

I went with

BrowserMatch AppEngine keep_out



dstiles 2008 [webmasterworld.com]

I'd have to go back through my old files (BU DVD's), however I'm most positive that I'd 74.125, denied long before 2008.

dstiles




msg:4566848
 7:40 pm on Apr 21, 2013 (gmt 0)

Can't recall that posting but it must have been me. :)

I've since blocked the 74.125.0.0/16 range completely.

moxie




msg:4566870
 9:19 pm on Apr 21, 2013 (gmt 0)

I've since blocked the 74.125.0.0/16 range completely.

I've been wondering about blocking that exact range for a very long time now, however was always worried about doing so. Am I correct in assuming that it's a bad idea to do so for ecom sites with respect to G rankings?

lucy24




msg:4566880
 9:59 pm on Apr 21, 2013 (gmt 0)

:: detour to double check ::

Wow, longer list than I thought. In addition to the faviconbot, I've got 74.125 listed as
Preview; Wireless Transcoder; urlresolver; Rich Snippets

No relation to the recently arrived snippetbot, which I'd briefly flagged as "no skin off my nose" but did a 180 when I found it snipping at my test site.

The faviconbot benefits from a special dispensation: requests for the favicon are exempted from the ordinary mod_auth blocks (including null UA). It's a, uhm, corollary beneficiary; the rule is intended to flag humans who got locked out by mistake.

keyplyr




msg:4566913
 1:41 am on Apr 22, 2013 (gmt 0)



Just a FYI - about a year ago, when the Google Preview was being discussed, I tested blocking the Google IP range that I had determined to be Preview, etc. The site I tested on dropped in human page loads by 20% to 30%. I removed the block and the site returned to normal traffic shortly thereafter. Conclusion: IMO users like the fact they can see that preview, or maybe they think something is wrong if they don't.

lucy24




msg:4566920
 2:48 am on Apr 22, 2013 (gmt 0)

That's my tentative impression of Preview too: It may not help you, but its absence can hurt you. "How come there's no preview? What's wrong with the site? Does it have dirty pictures or something? Did they go off the air?"

:: wandering off to post question in forum whose members have developed the hobby of willfully misunderstanding everything I say, but may still yield information ::

wilderness




msg:4566929
 3:22 am on Apr 22, 2013 (gmt 0)

FWIW, I don't use GWMT, nor google ads.
The only Google IP's allowed are 66.249.64-95 and ONLY from their bots. No previews. (nor are these restrictions set upon Google alone, it's the same for Yahoo and Bing).

Should some widget user land upon content of my pages in a SERP (which likely cannot be found any other place on the WWW), and chose to not view because a preview in unavailable?

Too bad, so sad.

Content, content, and more content. . . .

jlnaman




msg:4567089
 5:49 pm on Apr 22, 2013 (gmt 0)

Lucy24's examples were all for robots.txt. For better or worse, I have been returning a very short (28 bytes) response:
<?php
header("Content-Type: text/plain");
$currentsyntax='User-agent: *
Disallow: /

';
echo $currentsyntax;
?>

# note the extra <cr> before ';
It is also very useful (to me) for bizarre Bing requests.
The bottom-line is that the number of attempts at non-robot.txt files has decreased.

jlnaman




msg:4567090
 5:51 pm on Apr 22, 2013 (gmt 0)

Woops, I missed the hovercraft. But my KEEPOUT goes to the custom robots.php and not simple 403's. 403's are ignored.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved