crawler

Forum Moderators: open

Message Too Old, No Replies

crawler

keyplyr

10:46 pm on Mar 26, 2018 (gmt 0)

UA: crawler (crawler.feedback@gmail.com)
Protocol: HTTP/1.1
Robots.txt: ?
Host: AWS
18.233.0.0 - 18.233.255.255
18.233.0.0/16

blend27

11:10 pm on Mar 26, 2018 (gmt 0)

Robots.txt: NO

lucy24

5:49 am on Mar 27, 2018 (gmt 0)

Wasn't I talking just the other day about how much an @gmail address in a UA string inspires confidence?

Why yes. Yes, I was.

:)

keyplyr

6:03 am on Mar 27, 2018 (gmt 0)

I think at one point registrars wouldn't let new domains go through with contacts: @ hotmail.com, @gmail.com, @yahoo.com, etc.

Now that it's been deregulated anything goes.

MitchNginx

10:39 am on Mar 27, 2018 (gmt 0)

I have seen this start popping up in my server logs recently.

It does ask for robots.txt but then a search through some of my other site logs reveals it is also scanning hidden links in a page (honeypots) so clearly ignoring what a normal crawler should do. I also found scans for a number of non-existent pages on some of my sites.

x.x.x.x - - - [26/Mar/2018:21:45:01 +0200] "GET /robots.txt HTTP/1.1" 302 154 "-" "crawler (crawler.feedback@gmail.com)" "-"PORT:80 0.000 - . "GZIP:-"
x.x.x.x - - - [26/Mar/2018:21:46:31 +0200] "GET / HTTP/1.1" 302 154 "-" "crawler (crawler.feedback@gmail.com)" "-"PORT:80 0.000 - . "GZIP:-"

keyplyr

10:45 am on Mar 27, 2018 (gmt 0)

Hi MitchNginx and welcome to WebmasterWorld [webmasterworld.com]

MitchNginx

10:47 am on Mar 27, 2018 (gmt 0)

Hi keyplyr+ many thanks for the warm welcome

lucy24

8:16 pm on Mar 27, 2018 (gmt 0)

It occurs to me that the minimalist UA makes it kinda impossible to have a robots.txt Disallow, since any number of robots contain the name element �crawler�, and not all of them are malign. This is a problem because it is--surprise!--barely possible that this is in fact a compliant robot, based on its behavior yesterday and today. It requested assorted pages, including everything linked from the 403 page* (but not some extra stuff linked only from the / front page) along with various interior pages suggesting that it is also following leads from a certain directory. It did not, however, request pages from a roboted-out directory** that is also linked both from the front page and the 403 page.

Hmm. Faute de mieux, I tried adding a Disallow for the string �feedback� instead, after checking that no legitimate robot currently has this element in its UA string. We Shall See.

btw, I met it from a completely different AWS range, 54.87.235.abc (always the same abc).

* YMMV. My 403 page is made purely for humans, so there are lots of links.
** /boilerplate/

keyplyr

8:40 pm on Mar 27, 2018 (gmt 0)

It occurs to me that the minimalist UA makes it kinda impossible to have a robots.txt Disallow, since any number of robots contain the name element �crawler

The robots exclusion standard addresses specific language.

lucy24

8:49 pm on Mar 27, 2018 (gmt 0)

If you wanted to exclude a robot whose full name was �Crawler�, how would you word the Disallow: line?

keyplyr

8:52 pm on Mar 27, 2018 (gmt 0)


User-agent: Crawler
Disallow: /

So this is not RegEx. Disallowing "Crawler" does not disallow "Crawler1"
(If that's what your were concerned with)

However, as I keep mentioning, the robots.txt standard never did achieve 'standard' support across the web so different actors interpret it differently.

dougwilson

1:42 pm on Mar 28, 2018 (gmt 0)

I was just playing with this yesterday - SetEnvIfNoCase User-Agent ^(crawler|poster)$ blocked

lucy24

7:03 pm on Mar 28, 2018 (gmt 0)

SetEnvIfNoCase User-Agent

Sure, it's easy to physically block them. (Psst! That�s what the BrowserMatch and BrowserMatchNoCase locutions are for.) I don't need to, because I use header-based access controls so almost everything is blocked by default. Except for robots.txt, all Crawler requests I found in recent days' logs received a 403. I haven't bothered to check how many specific access-control rules they violated (or rather, failed to meet).

But the one thing better than a blocked request is a request that isn't made at all; that's why you always Disallow in robots.txt.

Disallowing "Crawler" does not disallow "Crawler1"

My understanding is that robots are supposed to interpret the Disallow line as broadly as possible: when in doubt, read it as �This means you� rather than �Oh, I had no idea they meant �Crawler� when they said �crawler��.

Later: I was halfway through typing this post when I had to go take a phone call. When I finally got off the phone, my email showed a reply to the mail I sent Crawler Contact Guy yesterday. So now we know that even if it's, ahem, @gmail, it is a valid address, and mail gets read.

MitchNginx

7:31 am on Mar 29, 2018 (gmt 0)

I took a slightly different approach with my Nginx and Apache blocker projects and scan for the user agent

crawler.feedback

instead and will monitor if they change it.

keyplyr

7:42 am on Mar 29, 2018 (gmt 0)

This is not the forum for server code discussion.

Further code examples/discussion should be done in the Apache Code Forum [webmasterworld.com]

crawler

keyplyr

blend27

lucy24

keyplyr

MitchNginx

keyplyr

MitchNginx

lucy24

keyplyr

lucy24

keyplyr

dougwilson

lucy24

MitchNginx

keyplyr

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week