IRLbot/2.0

Forum Moderators: open

Message Too Old, No Replies

IRLbot/2.0

dumb edu-crawler is up again

thetrasher

2:40 pm on Nov 2, 2005 (gmt 0)

IRLbot/2.0 arrives!
UA-String: "IRLbot/2.0 (+http://irl.cs.tamu.edu/crawler)"

IP: 128.194.135.xx (2005-10-21, 2005-10-28)
IP: 35.9.45.xx (2005-10-28, 2005-10-31)

It read robots.txt only on 2005-10-21, then it read only the index page again and again. At the "Texas A&M University" / "Michigan State University" (?!) they don't know about If-Modified-Since...

GaryK

9:51 pm on Nov 2, 2005 (gmt 0)

they don't know about If-Modified-Since...

In my experience these crawlers being used by universities are written by students and supervised by instructors that don't have a clue what effect their sloppy coding has on websites in general. If they did and also had some sense of ethics or even consideration we would not constantly have these kinds of problems.

Thanks for the heads-up about this new version.

Lord Majestic

9:53 pm on Nov 2, 2005 (gmt 0)

they don't know about If-Modified-Since

Almost no crawler (mass scale that is) supports it - I think even Google only started using it this year.

GaryK

10:03 pm on Nov 2, 2005 (gmt 0)

Your comments are appreciated and correct; perhaps I quoted the wrong section of the parent message. I was referring more to the pathetic way in which these universities develop and test crawlers with absolutely no regard to how the crawler will affect websites. They disregard robots.txt, request the same (often disallowed) page a zillion times in a row, every day, sometimes for days on end. I'd love to help them try and develop the next Google, for example. But I suspect even Google was better behaved when it was being developed than most of the projects coming out of the universities these days.

Lord Majestic

10:19 pm on Nov 2, 2005 (gmt 0)

I was referring more to the pathetic way in which these universities develop and test crawlers with absolutely no regard to how the crawler will affect websites.

Yes I agree with this - people who write crawlers ignore webmasters on their own peril :)

Dijkgraaf

11:13 pm on Nov 2, 2005 (gmt 0)

Well maybe someone should create a standards document on what things a spider/bot should and shouldn't do.
I did a bit of an excercise like that a few months back where I created a bot scoring system that would give positive and negative points depending on the behaviour.
If anyone is interested in that I can sticky them the URL to the web page.
It needs a bit more refinement and I do have some more ideas to improve on it.

wilderness

11:46 pm on Nov 2, 2005 (gmt 0)

Well maybe someone should create a standards document on what things a spider/bot should and shouldn't do.

What would it accomplish?

All bots need to do is comply with either UAG or TOS of a website of which they neither have an interest-in or ability-to, read.

GaryK

11:51 pm on Nov 2, 2005 (gmt 0)

IMO it's really simple.

1. Read robots.txt before each crawl.
2. Do not read files/folders that are disallowed in robots.txt.
3. Do not take more than one page every 2-3 seconds.
4. Make better use of If-Modified-Since.

Even then some of us need to put a limit on the number of bots that crawl our sites. In my case I can only spare the resources for bots that send me traffic. If your bot constantly crawls my site and never sends me any traffic in return it won't be long before the bot is disallowed or forcibly banned.

Am I missing anything?

Lord Majestic

11:53 pm on Nov 2, 2005 (gmt 0)

All bots need to do is comply with either UAG or TOS of a website

LOL, you can't be serious! Hell, it would take some time for a qualified lawyer to understand some ToS, not even talking about a machine! Bots can't understand human language and won't understand it for a long time, whether you like it or not the best you can do is use some agreed API like robots.txt to tell bots to stay away from your site.

wilderness

12:51 am on Nov 3, 2005 (gmt 0)

You know Majestic your a real PITA.

I refrained from adding dribble to a thread of yours when the door was left wide-open for bashing.

Why on earth would you take something out of context and then feel the need to supply an answer for the omitted material?

I submitted

All bots need to do is comply with either UAG or TOS of a website of which they neither have an interest-in or ability-to, read.

you quoted

All bots need to do is comply with either UAG or TOS of a website

Lord Majestic

12:58 am on Nov 3, 2005 (gmt 0)

All bots need to do is comply with either UAG or TOS of a website of which they neither have an interest-in or ability-to, read.

How can a bot understand TOS like for example the one here: [webmasterworld.com...]

Do you have a code or some good methodology on how to implement good understanding of human writing in multiple languages?

P.S. If you want to add your opinion to my thread then please do it - I posted here with just one reason: get feedback.

Dijkgraaf

1:07 am on Nov 3, 2005 (gmt 0)

Lord Majestic and wilderness. I think you are both saying the same thing in that bots cannot understand TOS or UAG, so maybe what is needed is a standard that does include this, something along the lines of the Privacy Policy.

Lord Majestic

1:12 am on Nov 3, 2005 (gmt 0)

I'd say a few additions to robots.txt would be well overdue -- robots.txt is supposed to be requested by all good bots and its ideal place for add-ons like Crawl-Delay that was introduced by Microsoft.

wilderness

2:28 am on Nov 3, 2005 (gmt 0)

How can a bot understand TOS like for example the one here: [webmasterworld.com...]
Do you have a code or some good methodology on how to implement good understanding of human writing in multiple languages?

majestic,
When we as webmasters submit an infraction (violation of their own UAG/TOS or our own UAG/TOS to either a bot or an internet provider?

We are compelled to jump through hoops while standing on our heads with balls of fire coming out of our backsides!
And for all this, in most instances we are favored with an automated reply informing us that we must "jump through hoops while standing on our heads with balls of fire coming out of our backsides."

Why should a bot, spidering our material, on our web pages and absorbing our bandwidth be expected to comply to any lesser standard?

I do agree and understand the translation issue between languages.
It's the primary reason that the majority of RIPE users are denied access to my sites. The ability does not exist for me to determine if my materials have been duplicated.

The translation tools are just not functional.
Additionally, if you've ever been fortuante to read something that has been translated from one langauge and then into 3-4 languages and once again back into the original langauge?
The end result is something that is not in any way similar to the original text.

Lord Majestic

3:04 am on Nov 3, 2005 (gmt 0)

Why should a bot, spidering our material, on our web pages and absorbing our bandwidth be expected to comply to any lesser standard?

I think there is a confusion here: which standard do you refer to? If its robots.txt then I am in complete agreement that bots should follow it, but if its something else then I am not sure I understand what you referring to as you can't possible be expecting a machine to understand millions of different Terms and Conditions pages that are often written by lawyers for lawyers, not humans.

wilderness

3:41 am on Nov 3, 2005 (gmt 0)

but if its something else then I am not sure I understand what you referring to as you can't possible be expecting a machine to understand millions of different Terms and Conditions pages that are often written by lawyers for lawyers, not humans.

And why not?

Users are excpected to comply with software UAG's when more than 90% don't even read them and the majority that do read haven't a clue or understanding what they are reading.

Google, Yahoo, MS and ALL the others have either UAG or TOS on their websites and expect compliance from visitors.

Why is Jo Schmo's website or Bob's Corner Grocery website any different in excpecting compliance from visitors, whether spidering bot or visitor?

If a webmaster has a page or two which is primarily links and images?
They are in a minor league as compared to many of us that participate in this forum and many other forums.

If a standard or capability doesn't exist?
Then one surely needs to be created.

Msg #6 of this thread.

Dijkgraaf

4:03 am on Nov 3, 2005 (gmt 0)

Well my message #6 isn't even about future standards, but just what standards I expect a bot currently to have.

reads robots.txt

obeys Disallow directives

Request robots.txt again if more than 24 hours since last request

Doesn't request robots.txt more than once an hour

Doesn't request files more than once every 5 seconds

Requests are not repeated within 24 hours

obeys no cache tag

obeys no follow tag

obeys no index tag

obeys none tag

request-header user-agent contains URL to page about bot

the about bot page explains how to disallow bot in robots.txt

request-header user-agent contains bot name in agent

request-header user-agent bot name matches the one it looks for in robots.txt / meta tags

request-header user-agent doesn't change often

recognises commented out links

doesn't request URL's with #

Stops revisiting not found (404) pages after a time

Doesn't frequently re-requests (404) pages

Stops revisiting moved permanently (301) pages

Stops revisiting gone (410) pages

And also some features that I'd like to see in all bots (some support some, but most don't)

request-header contains If-Modified-Since (this includes when fetching robots.txt)

obeys Disallow file extension wildcard directives

Allows site owner to control delay

allows site owner to control frequency of revisits

obeys no no-email-collection tag

request-header From contains e-mail address

request-header referer contains where it found link to the requested item

request-header user-agent contains contact e-mail address

the about bot page explains purpose of the bot

Lord Majestic

4:05 am on Nov 3, 2005 (gmt 0)

And why not?

Because the best available computers now do not have brain capacity of a bee and won't have it for another decade or so. What you want is a pipe dream that won't materialise for decades to come - whether you like it or not. It can't happen - it won't happen any time soon so you might as well forget it because its technically not feasible. If you are a genious who can prove this wrong then please do it -- I will be the first to bow and this achievement will no doubt contribute to computer science in a big way.

Google, Yahoo, MS and ALL the others have either UAG or TOS on their websites and expect compliance from visitors.

Yes but for visitors, bots are controlled via robots.txt - this is the best standard that we currently have and I agree it needes to be extended, for example Crawl-Delay parameter should really be supported by every good bot. Simple pattern matching probably should be made a requirement. And naturally supporting gzip and ideally If-Modified-Since/checksums to reduce bandwidth usage on all sides.

All of this is sensible, but your overly emotional reaction on bots not reading your T&Cs won't help solve anything whatsoever.

IRLbot/2.0

dumb edu-crawler is up again

thetrasher

GaryK

Lord Majestic

GaryK

Lord Majestic

Dijkgraaf

wilderness

GaryK

Lord Majestic

wilderness

Lord Majestic

Dijkgraaf

Lord Majestic

wilderness

Lord Majestic

wilderness

Dijkgraaf

Lord Majestic

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week