Forum Moderators: open

Message Too Old, No Replies

Experibot v1

         

keyplyr

11:47 am on Jan 18, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Experibot_v1
Protocol: HTTP/1.1
Robots.txt: Yes, but after requesting 2 previous pages and over 2 hours later.
Host: BEZEQINT-BROADBAND (bezeqint.net)
79.182.0.0 - 79.182.255.255
79.182.0.0/16

My site gets measurable traffic from this Israeli ISP, with a few pests mixed in; broadband users pulling down my entire site, presumably to save on bandwidth fees.

keyplyr

7:45 am on Jan 27, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Same host, different range...

Host: BEZEQINT-BROADBAND (bezeqint.net)
79.176.0.0 - 79.176.255.255
79.176.0.0/16

Pretends to come from a Google search. Still ignoring robots.txt.

79.176.184.19 - - [26/Jan/2016:17:24:06 -0800] "GET / HTTP/1.1" 301 494 "http://www.google.com" "Experibot_v1"
79.176.184.19 - - [26/Jan/2016:17:24:07 -0800] "GET / HTTP/1.1" 403 983 "http://www.google.com" "Experibot_v1"

lucy24

9:39 pm on Jan 27, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Huh. A few years ago, I was much vexed with assorted malign robots coming from Bezeq. I don't think I even realized they were a human ISP.

Still ignoring robots.txt.

Someone, possibly you, once explained why it's to a robot's advantage to ask for robots.txt before proceeding to ignore it, but I can't remember the reason. I don't remember meeting anyone who proceeded directly to requesting things they might otherwise not have known about, like /piwik/ -- and that's the only reason I can think of for asking.

keyplyr

10:55 pm on Jan 27, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Requesting robots.txt may get the agent through some filters with some IT configs. It also may give the impression to the novice that the agent is in fact supporting web standards.

However a smartly written bot just needs to connect to the file server to crawl through the heirachy. There is no need for a list. Any file that resides on the server can be accessed unless blocked by some method. If it is on the server then it is published.

lucy24

11:55 pm on Jan 27, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



However a smartly written bot just needs to connect to the file server to crawl through the hierachy.

Yikes. How does it do this? Can't a host / server administrator of approximately similar intelligence prevent it?

keyplyr

12:07 am on Jan 28, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can't a host / server administrator of approximately similar intelligence prevent it?

Yes, but then any files under "/" would be blocked for everybody.

So what ya want is a selective block/allow... and that's what we're doing :)

lucy24

7:22 am on Jan 28, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I meant, how does the bot get into the top level of the server? Wouldn't you have to request a numerical IP address, which would get you forcibly redirected to ... well, somewhere? And why would the directory structure be visible? Isn't that what -Indexes is for?

keyplyr

8:04 am on Jan 28, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Of course all bots (and everything else) need to discover the domain (IP address) by some means, so yes a request occurs.

In most cases a directory structure isn't visible; it doesn't need to be. It really depends how the bot software is set-up, but, for example, if a bot's script is written to get all files after root (/) then that's what happens, except for blocks or, as you mentioned, redirects.

-Indexes (in htaccess) is for human looky-loos (sp?) and does not affect direct requests. I'm pretty sure you've had a bot scrape all your image files despite -Indexes being present in the folder's htaccess, haven't you? And haven't you also wondered how certain files can be requested despite the fact that no link to them is anywhere on your site?

Most bots we see as webmasters are crawlers, going from link to link in a linear fashion. Other bots (usually vertical) are built to connect to the file server and get all your stuff. This is where it's fun to plant some surprises :)

Disclaimer - even though this thread is titled "Experibot" there is no evidence this agent resembles the behavior discussed after the first two posts.

dstiles

7:32 pm on Jan 28, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



> connect to the file server and get all your stuff

I'd be interested to know how anything can do that without knowing the filenames. How can a bot even get access to the file server to begin with? That would mean by-passing the web server, surely, which would denote a serious server setup error.

The only way I can think of this working is if the web server offered a directory listing if a test access failed to find a file, and that is a rarity nowdays, reserved generally for tech repositories.

keyplyr

7:39 pm on Jan 28, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I didn't say "by-passing the web server." Without going into detail, a bot just doesn't need a list to get a file.

lucy24

11:01 pm on Jan 28, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And haven't you also wondered how certain files can be requested despite the fact that no link to them is anywhere on your site?

Matter of fact, I've noticed that nobody ever requests deeper material from my test site. They'll take guesses like /admin/ and its relatives, but all the actual directories have goofy names and, nope, nobody has ever requested them.

Options -Indexes is inherited, so you only need to say it once. Unfortunately, hosts tend to have +Indexes by default, so you do have to change it. In fact, it's probably one of the first things people put in their first htaccess file when they first venture to make one. And then things balloon from there ;)

a bot just doesn't need a list to get a file

But it needs some way of knowing that the file exists. They don't just make up names. Unless they're the Googlebot looking for soft 404s.

keyplyr

11:51 pm on Jan 28, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But it needs some way of knowing that the file exists
It can get all that by requesting the files under each level (folder.) It doesn't need the file name *prior* to the request. It tells the server to open each folder and get each type of file (or all files) and in doing so it learns the file names.
Example:
Command=GetFolders&Type=File&CurrentFolder=...

Why don't you go to github or another software repository and test drive a few bots. There are many types of bot software, each doing their own method, written differently; I'm not sayn' all bots do things the same way.

Oblivious

9:34 am on Jan 30, 2016 (gmt 0)

10+ Year Member



Hello everyone!

I am the creator of the Experibot_v1 crawler.

First, let me express my apologies for any misbehavior on my bot's part (I feel like a parent).
It was certainly not my intent to create a malicious bot. In my latest patch, I forgot to include, in the bot's name, the explanation page (a little outdated): [dl.dropboxusercontent.com...]

Secondly, I've worked very hard to adhere as best as I could to the robots.txt standards, as far as allow / disallow directives are concerned. I've designed the bot to always fetch the robots.txt file before any page, other than the absolute root address is fetched, then to conform to its directives from now on.

Thirdly, as far as the crawl-delay is concerned, I am ordering my links in "batches" of 10K links (it takes my program about 8 seconds to crawl each 1000), such that each link in the 10K list comes from a different IP address, making it improbable that I would "hit" the same server over and over again in a short period of time (at least by design).

I apologize again and would like to improve my crawler so it doesn't do any damage to anyone.
If you have the will and time, please let me know, either here or via email (amirkr@gmail.com) what happened. Did I repeatedly crawl links from your sites in a short period of time? How was the robots.txt file not respected? (what links did you disallow but I downloaded anyway?) etc.
In any event, you can also send me your site's root address (like "www.mysite.com" or something) and I can put it in the blacklist so no crawl will be done there whatsoever.

Thanks, and sorry again,
Amir.

P.S - The structure of link exploration is simple, I just follow links from sites. No hierarchy is followed in some DFS way.

keyplyr

10:02 am on Jan 30, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are you for real?

User-agent: Experibot
Disallow:

That means no. It does not mean to keep hitting pages and pretending to come from a Google search. So now you're blocked across our sites. Well done.

Oblivious

10:08 am on Jan 30, 2016 (gmt 0)

10+ Year Member



Hey, there is no need for insults.
First, the bot's name is Experibot_v1 (with the v1!). It does not detect just "Experibot", so in the regex which parses the robots.txt file, I look for either
User-agent: * OR User-agent: Experibot_v1

Second, an empty disallow directive, such as
Disallow:
means I can crawl everything, doesn't it?

Disallowing everything is
Disallow: /

I've worked a lot to try and do everything "by the book". My bot does not just "ignoes robot.txt". It was about three net months of added work to configure the architecture to adhere by these (justified) standards.
If you choose to block my bot that is - of course - your every right.
I do apologize again if my bot caused any problems with your site, or any other site and will re-configure it to detect also the name "experibot" without the "v1".

Amir.

[edited - removed link]

[edited by: Oblivious at 10:34 am (utc) on Jan 30, 2016]

keyplyr

10:16 am on Jan 30, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



* Please remove the link you posted to avoid our forum TOS violation *

The insult is you coming here saying your bot supports web standards when it clearly does not.

BTW - the example I gave was not a cut'n paste, just a quick response from my mobile phone. The actual robots.txt has correct syntax, and has been ignored on 2 of our sites by your bot.

So yes, it's now blocked:

79.176.184.19 - - [29/Jan/2016:09:12:03 -0800] "GET / HTTP/1.1" 301 494 "http://www.google.com" "Experibot_v1"
79.176.184.19 - - [29/Jan/2016:09:12:03 -0800] "GET / HTTP/1.1" 403 983 "http://www.google.com" "Experibot_v1"

Oblivious

10:21 am on Jan 30, 2016 (gmt 0)

10+ Year Member



"I've worked very hard to adhere as best as I could to the robots.txt standards"

I think I specifically wrote I'm trying, not that I've reached a perfect adherence (though I make every effort to). I tested the robots.txt code on a large number of sample sites in order to verify I'm not crawling any non allowed pages, but due to the enormous number of possibilities (including programming errors on my part - I'm human after all) I may have made mistakes. There was no insult intended and I do apologize for any problems the bot caused.

As I said, blocking is your every right and I respect that. Can I kindly ask you to write your site's address so I can see the robots.txt file? To figure out where the problem was in order for it not to happen in the future?

EDIT - I found one of the sites in your profile, whose robots.txt was misread by the bot.
The syntax is indeed exactly as it should be. I'm shutting the crawler down and will test this immediately when I return home. Thanks for pointing out a problem.

Amir.

Oblivious

10:39 am on Jan 30, 2016 (gmt 0)

10+ Year Member



Again, thanks. It's important for me to detect these things early on.
I suspect some REGEX error which caused the code to misidentify the correct part in the robots.txt file.

keyplyr

10:50 am on Jan 30, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you read through the many posts in the "Search Engine Spider and User Agent Identification" forum here at WebmasterWorld you will get a pretty good idea what web site owners expect from bot runners. After all, it is our property you are after.

Here are just a couple main points:
1. Request robots.txt first before any other files and then support it.
2. Include a link to an info page in the UA string. The info page should reside at the same web address as the company/individual responsible for the bot. The info page should identify who you are, what you are taking from our web properties and what you intend to do with it. This should be specific.
3. Respect our copyright.

This is only fair. Most web site owners understand how the web works with its many interests, but since it is our property you are asking for, I think we deserve this consideration and many of us will block your bot without it.

All this uproar is because the first time your bot hit one of my sites, it grabbed a half dozen files prior to requesting robots.txt. At that time the User Agent was not known to me so it was not disallowed. However because of what I consider "bad behavior" I put up a temporary block & a disallow in robots.txt.

The next 2 times it visited, it did not even request robots.txt AFAIK. It went straight for web pages where it was blocked. It is now in the block list until further consideration.

Oblivious

11:01 am on Jan 30, 2016 (gmt 0)

10+ Year Member



As I said, you're perfectly right. I do understand this is your property and your are entitled to allow / disallow anyone from accessing it at your will. I fully respect that.

In the previous version, the link to my info page for the crawler was supplied in the bot UA string (it was "Experibot_v1 [the address]". But for some reason I forgot to add it when I re-wrote the code. This will be easily fixed. I take great care for this, as I've repeatedly searched the web for mentioning of the crawler even after I debugged it to the best of my ability, without anyone approaching me with problems (yeah, I forgot the explanation page, so that might have been problematic... sorry).

The second two times, the bot went straight to pages because I've already stored your robots.txt files on disk (and saved it for later "runs" of the crawler), and it was not yet time (several days) to re-ask for it (since it might have changed). Meaning - I ran the crawler, it FIRST grabbed the robots.txt file THEN crawled any other pages, then it stopped for a few days. I re-ran it a few days later, having the robots.txt file already stored in my system, so I just asked for the "allowed" pages. Is this considered a really bad behavior? (Assuming the correct robots IS stored and IS respected).

Thanks again for your reply, everything you wrote is being taken care of.

keyplyr

11:14 am on Jan 30, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I also operate a bot (you can find info at my site under "resources.") I have found that a 24 hour cache for robots.txt is best since some webmasters may revise it daily. Then if it is not followed, conversations like this may result :)

Oblivious

11:30 am on Jan 30, 2016 (gmt 0)

10+ Year Member



I understand. I will make the necessary adjustments.
Thanks again for the input!

Oblivious

7:46 pm on Jan 30, 2016 (gmt 0)

10+ Year Member



Wow. It's a very good thing I've found this thread and that you pointed out the problem.
It turns out I truncated the disallows directive at the wrong index, such that "Disallow: /" became the string ":" instead of "/" (I searched for ":" in the URL instead of "/"). This happens not only with your site, but any other site. So the entire robots.txt file handling was malformed, due to a single index shift. It's a result of a mishap in the code which happened after I've already tested it. Bah.
Thanks.

lucy24

10:32 pm on Jan 30, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



it takes my program about 8 seconds to crawl each 1000

:: counting on fingers ::
I make that 125 requests per second, which is approximately 124 more than I like to see from a robot.

It does not detect just "Experibot", so in the regex which parses the robots.txt file, I look for either
User-agent: * OR User-agent: Experibot_v1

Er, aren't robots supposed to interpret User-Agent directives as broadly as possible? Change the ### RegEx. If you really believe there are unrelated robots named "Experibotblahblah" or "Experibot123" whose rules you don't want to follow, a simple (\b|_) will do it. You can't expect people to tweak their robots.txt file just because some minor robot has changed version numbers.

Oblivious

11:47 pm on Jan 30, 2016 (gmt 0)

10+ Year Member



No, you understood me wrong.
125 per second is correct, but what I meant was that each 10K consecutive links I download are always from *different* IP addresses. This means (due to other considerations I've made) that the approximate distance between each two connections to the same site is 10K links. Since I download LESS than 125 per second, I can guarantee 8 seconds per 1000 links, then it's 80 seconds between each two calls to the same site, almost for sure.

Second, the name has been the same name since March 2014. I don't plan on changing it soon and I already said I'll revise the robots.txt to support something like "Experibot".

Third, the problem I reported earlier (just sharing) was only to a small subset of the sites. But I'm working on making sure the bot is as polite as possible, most importantly by returning the description URL to the UA string.

P.S - there is no need to get angry (If I got the "###" reference right). I'm a single person, trying to write a crawler - which is a daunting task - and for months I've been tweaking and working on it for hours, doing the best I can to be polite. this is a n extremely complicated task and I'm only human. Any mishap is not due to negligence or maliciousness but simple oversight which I'm working to fix. I'm sorry for any inconvenience to any site owner, but I've made sure that I make very few connections (the 10K links mechanism) before they can report a problem to me so I'll block their site myself, to cause minimal damage. I'm not unfurling some catastrophic crawler to create denial of service attacks.

lucy24

12:13 am on Jan 31, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



it's 80 seconds between each two calls to the same site
Yup, that seems more than reasonable. Still, bonus points if the robot understands and obeys
Crawl-Delay: 120
(Mine doesn't actually say this; it's "Crawl-Delay: 3".) If you go by the official robots.txt standard, the only sine qua non is the "Disallow:" directive. But some things are so ubiquitous there's really no excuse. Nope, not even if Google Itself ignores it ;)

Oblivious

12:20 am on Jan 31, 2016 (gmt 0)

10+ Year Member



Yes, of course, you're right.
I wanted to do that (instead of the more complicated scheme I have going), but the bot's structure does not allow that (without a tremendous change I'm not sure how to do), since I have no way to keep track of when was the last time I crawled each domain in a quick enough way that will not clog the downloading pipeline. So I had to make do with this solution and hope no one will complain for crawl delays of over a minute :)

Oblivious

8:52 pm on Jan 31, 2016 (gmt 0)

10+ Year Member



To the thread initiator -
I have fixed several issues. A few connections were made to your site to download the robots.txt file using the old UA string. Any further connections should include the updated string with the info page, re-written according to your suggested guidelines.
Regardless, I have made sure your robots.txt file is correctly parsed and added your site to the no-crawl list, so no pages (other than the robots.txt file) are expected to be fetched.

keyplyr

5:42 am on Feb 1, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't know what "added your site to the no-crawl list" accomplishes. I have more than one site and I manage a half dozen more and I already had this agent blocked.

What I'd be more interested in would be for you to (again) tell us who you are & what you are taking from our web properties and what you intend to do with it. Then we decide whether to let you take it. No "no-crawl list" needed :)

Oblivious

5:08 pm on Feb 1, 2016 (gmt 0)

10+ Year Member



1. I could only remove the site I know, passion4jazz :)

2. My user agent string now contains a link to an explanatory page (I don't know if it's okay to post the link here, even though it's my page). In short: I'm Amir, a private researcher / entrepreneur operating from my home in Israel (Bezeq is an Israeli telephone service provider). I'm conducting a web search experiment, where I'm downloading pages of all kinds (no specific content searched, as long as it's in English) and index them to try and build a small scale search engine on my machine. Any pages I'm downloading with the Experibot_v1 crawler are not published and are only used to demonstrate an indexing mechanism.

By the way, I was able to "take" the robots.txt file of the aforementioned site with my current agent (Experibot_v1). Are you sure its blocked? I don't want to cause any problems to your sites.

Amir.
This 34 message thread spans 2 pages: 34