IndeedBot

Forum Moderators: open

Message Too Old, No Replies

IndeedBot

TorontoBoy

12:30 pm on Oct 23, 2017 (gmt 0)

UA: Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101 Firefox/38.0 (IndeedBot 1.1)
Protocol: HTTP/1.1
Robots.txt: No
IP: 198.58.75.* 199.119.215.*
Host: 198.58.72.0 - 198.58.79.255 199.119.212.0 - 199.119.215.255
CIDR: 198.58.72.0/21 199.119.212.0/22
Organization: CyrusOne LLC
Notes: general scraper bot

jonasjacek

3:11 pm on Oct 24, 2017 (gmt 0)

For both CIDR's I get this notice...

$ whois 198.58.72.0/21

#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# https://www.arin.net/public/whoisinaccuracy/index.xhtml
#
No match found for n + 198.58.72.0/21.

There seems to be an inaccuracy in their records.

MitchNginx

6:54 am on Apr 13, 2018 (gmt 0)

I can concur from my logs
- No robots.txt requested
- Bot digging all links in a page including hidden honeypots

lucy24

7:58 pm on Jul 22, 2018 (gmt 0)

IP: 107.182.234.abc
UA: Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101 Firefox/38.0 (IndeedBot 1.1)
Requests: html with all associated css and js (including piwik.js, requested over again** on each page), giving page as referer for css/js
robots.txt: yes and no*

* �yes�: yes, it requested robots.txt
�no�: but only as request no. 891 of 1002 (plus four on neighboring site associated with attempt to opt-out of piwik logging), long after requesting files in roboted-out directory

** robots aren't allowed to get piwik files, so I don't know whether it would still have requested the file over and over again if it had not kept getting blocked. I suspect no, because non-blocked scripts and stylesheets were only requested once, no matter how many pages use them.

Is it now blocked? Yes, indeed.

User-Agent: IndeedBot
Disallow: /

BrowserMatch IndeedBot bad_agent

The above is the exact equivalent of putting up a No Admittance sign ... and deadbolting the door.

[edited by: keyplyr at 1:58 am (utc) on Jul 23, 2018]
[edit reason] merger clean-up [/edit]

keyplyr

8:10 pm on Jul 22, 2018 (gmt 0)

You might consider just blocking that Server Farm range...

Host: uk2group.com
107.182.224.0 - 107.182.239.255
107.182.224.0/20

TorontoBoy

8:14 pm on Jul 22, 2018 (gmt 0)

A 1000 request scrape is a really thorough and deep scrape. If servers had feeling that would hurt.

lucy24

9:09 pm on Jul 22, 2018 (gmt 0)

If servers had feeling that would hurt.

Yes, I was just saying the other day somewhere hereabouts that it's been a long, long time since I sustained a really comprehensive hit. Evidently someone Down There was listening.

It's because they came in with fully humanoid headers. In fact I think one of them is in a form that I formerly blocked, but gave it up because it's common with mobiles too. And yeah, Firefox/38 is a bit sketchy, but humans do weird things.

The 1000-odd requests spanned about 8 minutes, which works out to about 2 requests per second--not an inherently outrageous pace.

:: detour to pore over headers ::

Hurrah, I think I've found something blockable.

Server Farm range

There may be a bit of subletting going on. I found a stray possibly-human image request. Then again, it could have been a robot testing the waters. (If so, it never came back.)

- - -

[edited by: keyplyr at 12:08 am (utc) on Jul 23, 2018]
[edit reason] fix typo requested by poster [/edit]

not2easy

10:49 pm on Jul 22, 2018 (gmt 0)

I'm wondering what's with IndeedBot myself. I had not seen them for at least a year and this month they've hit on two sites, both visits from within 198.58.72.0/24 which gave them a 403.

keyplyr

12:06 am on Jul 23, 2018 (gmt 0)

Indeed.com is an American worldwide employment-related search engine for job listings launched in November 2004. The bot most probably collects job listings from the domains it visits.

@not2easy - I block cyrusone.com ranges as well.

lucy24

2:02 am on Jul 23, 2018 (gmt 0)

Incidentally, I was mistaken on one point above:

attempt to opt-out of piwik logging

It wasn't trying to opt-out--though I would not be surprised if some robots figured out how to do this, since it's a default module. It was merely requesting the iframe src that contains the opt-out information. I guess it's considered as essential as scripts and stylesheets. (But why? I can understand scripts, but what job-listing-related information could possibly be hidden away in a stylesheet?)

I took a closer look. It was a very systematic top-to-bottom spidering: first all pages that are directly linked from the front page. (It was so much easier when I didn't have any, so robots claiming the root as referer could be comprehensively blocked.) Then all pages directly linked from those. And then the third layer--this is where I could see it requesting successive chapters of books, for some reason always going from last to first.

Edit: Gosh, keyplyr, that was unnerving. You relocated the thread while I was in the act of posting: �Huh? Where�d it go?!�

keyplyr

6:44 pm on Oct 13, 2018 (gmt 0)

Also coming from...
Host: uk2group.com
173.244.192.0 - 173.244.223.255
173.244.192.0/19

JamesSC

10:40 pm on Oct 13, 2018 (gmt 0)

In these bot listings the line Robots.txt: No/Yes means the bot does not/does honor robots.txt directives, or does it mean something else?

keyplyr

10:44 pm on Oct 13, 2018 (gmt 0)

Hi JamesSC, good question. I am identifying whether the bot requested robots.txt. I am not accounting whether the bot obeyed the directives.

lucy24

10:57 pm on Oct 13, 2018 (gmt 0)

means the bot does not/does honor robots.txt directives, or does it mean something else?

As you can see from the present thread, there's a wide range of possible behaviors. Because keyplyr and I have different sites and different ways of processing logs, his descriptions will generally only be able to say whether the request was made. Mine can sometimes offer more detail (�Yes but ignores� or �Yes but with a different UA� or �Yes-well-sort-of-barely� as in the present robot) in those comparatively rare cases where the request was, in fact, made.

A further complication is that a great many robots' first visit consists of nothing but a request for the front page--which is not likely to be comprehensively roboted-out, so there's really no way to know if the robot is compliant. Some large sites have dynamic robots.txt files that generate a �this means you� even on brand-new requests, using the current UA, but most sites can only disallow known quantities.

JamesSC

1:02 am on Oct 14, 2018 (gmt 0)

Ah, so. Thanks.

keyplyr

10:36 am on Oct 14, 2018 (gmt 0)

Just to be clear... you will never hear me say �Yes-well-sort-of-barely� :)

IndeedBot

TorontoBoy

jonasjacek

MitchNginx

lucy24

keyplyr

TorontoBoy

lucy24

not2easy

keyplyr

lucy24

keyplyr

JamesSC

keyplyr

lucy24

JamesSC

keyplyr

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week