Forum Moderators: open

Message Too Old, No Replies

crawler

         

keyplyr

10:46 pm on Mar 26, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




UA: crawler (crawler.feedback@gmail.com)
Protocol: HTTP/1.1
Robots.txt: ?
Host: AWS
18.233.0.0 - 18.233.255.255
18.233.0.0/16

blend27

11:10 pm on Mar 26, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Robots.txt: NO

lucy24

5:49 am on Mar 27, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wasn't I talking just the other day about how much an @gmail address in a UA string inspires confidence?

Why yes. Yes, I was.

:)

keyplyr

6:03 am on Mar 27, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think at one point registrars wouldn't let new domains go through with contacts: @ hotmail.com, @gmail.com, @yahoo.com, etc.

Now that it's been deregulated anything goes.

MitchNginx

10:39 am on Mar 27, 2018 (gmt 0)

5+ Year Member



I have seen this start popping up in my server logs recently.

It does ask for robots.txt but then a search through some of my other site logs reveals it is also scanning hidden links in a page (honeypots) so clearly ignoring what a normal crawler should do. I also found scans for a number of non-existent pages on some of my sites.

x.x.x.x - - - [26/Mar/2018:21:45:01 +0200] "GET /robots.txt HTTP/1.1" 302 154 "-" "crawler (crawler.feedback@gmail.com)" "-"PORT:80 0.000 - . "GZIP:-"
x.x.x.x - - - [26/Mar/2018:21:46:31 +0200] "GET / HTTP/1.1" 302 154 "-" "crawler (crawler.feedback@gmail.com)" "-"PORT:80 0.000 - . "GZIP:-"

keyplyr

10:45 am on Mar 27, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi MitchNginx and welcome to WebmasterWorld [webmasterworld.com]

MitchNginx

10:47 am on Mar 27, 2018 (gmt 0)

5+ Year Member



Hi keyplyr+ many thanks for the warm welcome

lucy24

8:16 pm on Mar 27, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It occurs to me that the minimalist UA makes it kinda impossible to have a robots.txt Disallow, since any number of robots contain the name element “crawler”, and not all of them are malign. This is a problem because it is--surprise!--barely possible that this is in fact a compliant robot, based on its behavior yesterday and today. It requested assorted pages, including everything linked from the 403 page* (but not some extra stuff linked only from the / front page) along with various interior pages suggesting that it is also following leads from a certain directory. It did not, however, request pages from a roboted-out directory** that is also linked both from the front page and the 403 page.

Hmm. Faute de mieux, I tried adding a Disallow for the string “feedback” instead, after checking that no legitimate robot currently has this element in its UA string. We Shall See.

btw, I met it from a completely different AWS range, 54.87.235.abc (always the same abc).


* YMMV. My 403 page is made purely for humans, so there are lots of links.
** /boilerplate/

keyplyr

8:40 pm on Mar 27, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It occurs to me that the minimalist UA makes it kinda impossible to have a robots.txt Disallow, since any number of robots contain the name element “crawler
The robots exclusion standard addresses specific language.

lucy24

8:49 pm on Mar 27, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you wanted to exclude a robot whose full name was “Crawler”, how would you word the Disallow: line?

keyplyr

8:52 pm on Mar 27, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




User-agent: Crawler
Disallow: /


So this is not RegEx. Disallowing "Crawler" does not disallow "Crawler1"
(If that's what your were concerned with)

However, as I keep mentioning, the robots.txt standard never did achieve 'standard' support across the web so different actors interpret it differently.

dougwilson

1:42 pm on Mar 28, 2018 (gmt 0)

10+ Year Member Top Contributors Of The Month



I was just playing with this yesterday - SetEnvIfNoCase User-Agent ^(crawler|poster)$ blocked

lucy24

7:03 pm on Mar 28, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



SetEnvIfNoCase User-Agent
Sure, it's easy to physically block them. (Psst! That’s what the BrowserMatch and BrowserMatchNoCase locutions are for.) I don't need to, because I use header-based access controls so almost everything is blocked by default. Except for robots.txt, all Crawler requests I found in recent days' logs received a 403. I haven't bothered to check how many specific access-control rules they violated (or rather, failed to meet).

But the one thing better than a blocked request is a request that isn't made at all; that's why you always Disallow in robots.txt.

Disallowing "Crawler" does not disallow "Crawler1"
My understanding is that robots are supposed to interpret the Disallow line as broadly as possible: when in doubt, read it as “This means you” rather than “Oh, I had no idea they meant ‘Crawler’ when they said ‘crawler’”.

Later: I was halfway through typing this post when I had to go take a phone call. When I finally got off the phone, my email showed a reply to the mail I sent Crawler Contact Guy yesterday. So now we know that even if it's, ahem, @gmail, it is a valid address, and mail gets read.

MitchNginx

7:31 am on Mar 29, 2018 (gmt 0)

5+ Year Member



I took a slightly different approach with my Nginx and Apache blocker projects and scan for the user agent
crawler.feedback
instead and will monitor if they change it.

keyplyr

7:42 am on Mar 29, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is not the forum for server code discussion.

Further code examples/discussion should be done in the Apache Code Forum [webmasterworld.com]