Forum Moderators: open

Message Too Old, No Replies

IndeedBot

         

TorontoBoy

12:30 pm on Oct 23, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



UA: Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101 Firefox/38.0 (IndeedBot 1.1)
Protocol: HTTP/1.1
Robots.txt: No
IP: 198.58.75.* 199.119.215.*
Host: 198.58.72.0 - 198.58.79.255 199.119.212.0 - 199.119.215.255
CIDR: 198.58.72.0/21 199.119.212.0/22
Organization: CyrusOne LLC
Notes: general scraper bot

jonasjacek

3:11 pm on Oct 24, 2017 (gmt 0)

5+ Year Member



For both CIDR's I get this notice...

$ whois 198.58.72.0/21

#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# https://www.arin.net/public/whoisinaccuracy/index.xhtml
#
No match found for n + 198.58.72.0/21.


There seems to be an inaccuracy in their records.

MitchNginx

6:54 am on Apr 13, 2018 (gmt 0)

5+ Year Member



I can concur from my logs
- No robots.txt requested
- Bot digging all links in a page including hidden honeypots

lucy24

7:58 pm on Jul 22, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



IP: 107.182.234.abc
UA: Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101 Firefox/38.0 (IndeedBot 1.1)
Requests: html with all associated css and js (including piwik.js, requested over again** on each page), giving page as referer for css/js
robots.txt: yes and no*

* “yes”: yes, it requested robots.txt
“no”: but only as request no. 891 of 1002 (plus four on neighboring site associated with attempt to opt-out of piwik logging), long after requesting files in roboted-out directory

** robots aren't allowed to get piwik files, so I don't know whether it would still have requested the file over and over again if it had not kept getting blocked. I suspect no, because non-blocked scripts and stylesheets were only requested once, no matter how many pages use them.

Is it now blocked? Yes, indeed.

User-Agent: IndeedBot
Disallow: /

BrowserMatch IndeedBot bad_agent

The above is the exact equivalent of putting up a No Admittance sign ... and deadbolting the door.

[edited by: keyplyr at 1:58 am (utc) on Jul 23, 2018]
[edit reason] merger clean-up [/edit]

keyplyr

8:10 pm on Jul 22, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You might consider just blocking that Server Farm range...

Host: uk2group.com
107.182.224.0 - 107.182.239.255
107.182.224.0/20

TorontoBoy

8:14 pm on Jul 22, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



A 1000 request scrape is a really thorough and deep scrape. If servers had feeling that would hurt.

lucy24

9:09 pm on Jul 22, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If servers had feeling that would hurt.
Yes, I was just saying the other day somewhere hereabouts that it's been a long, long time since I sustained a really comprehensive hit. Evidently someone Down There was listening.

It's because they came in with fully humanoid headers. In fact I think one of them is in a form that I formerly blocked, but gave it up because it's common with mobiles too. And yeah, Firefox/38 is a bit sketchy, but humans do weird things.

The 1000-odd requests spanned about 8 minutes, which works out to about 2 requests per second--not an inherently outrageous pace.

:: detour to pore over headers ::

Hurrah, I think I've found something blockable.

Server Farm range
There may be a bit of subletting going on. I found a stray possibly-human image request. Then again, it could have been a robot testing the waters. (If so, it never came back.)


- - -

[edited by: keyplyr at 12:08 am (utc) on Jul 23, 2018]
[edit reason] fix typo requested by poster [/edit]

not2easy

10:49 pm on Jul 22, 2018 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I'm wondering what's with IndeedBot myself. I had not seen them for at least a year and this month they've hit on two sites, both visits from within 198.58.72.0/24 which gave them a 403.

keyplyr

12:06 am on Jul 23, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Indeed.com is an American worldwide employment-related search engine for job listings launched in November 2004. The bot most probably collects job listings from the domains it visits.

@not2easy - I block cyrusone.com ranges as well.

lucy24

2:02 am on Jul 23, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Incidentally, I was mistaken on one point above:
attempt to opt-out of piwik logging
It wasn't trying to opt-out--though I would not be surprised if some robots figured out how to do this, since it's a default module. It was merely requesting the iframe src that contains the opt-out information. I guess it's considered as essential as scripts and stylesheets. (But why? I can understand scripts, but what job-listing-related information could possibly be hidden away in a stylesheet?)

I took a closer look. It was a very systematic top-to-bottom spidering: first all pages that are directly linked from the front page. (It was so much easier when I didn't have any, so robots claiming the root as referer could be comprehensively blocked.) Then all pages directly linked from those. And then the third layer--this is where I could see it requesting successive chapters of books, for some reason always going from last to first.


Edit: Gosh, keyplyr, that was unnerving. You relocated the thread while I was in the act of posting: “Huh? Where’d it go?!”

keyplyr

6:44 pm on Oct 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Also coming from...
Host: uk2group.com
173.244.192.0 - 173.244.223.255
173.244.192.0/19

JamesSC

10:40 pm on Oct 13, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



In these bot listings the line Robots.txt: No/Yes means the bot does not/does honor robots.txt directives, or does it mean something else?

keyplyr

10:44 pm on Oct 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi JamesSC, good question. I am identifying whether the bot requested robots.txt. I am not accounting whether the bot obeyed the directives.

lucy24

10:57 pm on Oct 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



means the bot does not/does honor robots.txt directives, or does it mean something else?
As you can see from the present thread, there's a wide range of possible behaviors. Because keyplyr and I have different sites and different ways of processing logs, his descriptions will generally only be able to say whether the request was made. Mine can sometimes offer more detail (“Yes but ignores” or “Yes but with a different UA” or “Yes-well-sort-of-barely” as in the present robot) in those comparatively rare cases where the request was, in fact, made.

A further complication is that a great many robots' first visit consists of nothing but a request for the front page--which is not likely to be comprehensively roboted-out, so there's really no way to know if the robot is compliant. Some large sites have dynamic robots.txt files that generate a “this means you” even on brand-new requests, using the current UA, but most sites can only disallow known quantities.

JamesSC

1:02 am on Oct 14, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



Ah, so. Thanks.

keyplyr

10:36 am on Oct 14, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just to be clear... you will never hear me say “Yes-well-sort-of-barely” :)