Forum Moderators: open
IP: 128.194.135.xx (2005-10-21, 2005-10-28)
IP: 35.9.45.xx (2005-10-28, 2005-10-31)
It read robots.txt only on 2005-10-21, then it read only the index page again and again. At the "Texas A&M University" / "Michigan State University" (?!) they don't know about If-Modified-Since...
they don't know about If-Modified-Since...
Thanks for the heads-up about this new version.
1. Read robots.txt before each crawl.
2. Do not read files/folders that are disallowed in robots.txt.
3. Do not take more than one page every 2-3 seconds.
4. Make better use of If-Modified-Since.
Even then some of us need to put a limit on the number of bots that crawl our sites. In my case I can only spare the resources for bots that send me traffic. If your bot constantly crawls my site and never sends me any traffic in return it won't be long before the bot is disallowed or forcibly banned.
Am I missing anything?
All bots need to do is comply with either UAG or TOS of a website
LOL, you can't be serious! Hell, it would take some time for a qualified lawyer to understand some ToS, not even talking about a machine! Bots can't understand human language and won't understand it for a long time, whether you like it or not the best you can do is use some agreed API like robots.txt to tell bots to stay away from your site.
I refrained from adding dribble to a thread of yours when the door was left wide-open for bashing.
Why on earth would you take something out of context and then feel the need to supply an answer for the omitted material?
I submitted
All bots need to do is comply with either UAG or TOS of a website of which they neither have an interest-in or ability-to, read.
you quoted
All bots need to do is comply with either UAG or TOS of a website
All bots need to do is comply with either UAG or TOS of a website of which they neither have an interest-in or ability-to, read.
How can a bot understand TOS like for example the one here: [webmasterworld.com...]
Do you have a code or some good methodology on how to implement good understanding of human writing in multiple languages?
P.S. If you want to add your opinion to my thread then please do it - I posted here with just one reason: get feedback.
How can a bot understand TOS like for example the one here: [webmasterworld.com...]Do you have a code or some good methodology on how to implement good understanding of human writing in multiple languages?
majestic,
When we as webmasters submit an infraction (violation of their own UAG/TOS or our own UAG/TOS to either a bot or an internet provider?
We are compelled to jump through hoops while standing on our heads with balls of fire coming out of our backsides!
And for all this, in most instances we are favored with an automated reply informing us that we must "jump through hoops while standing on our heads with balls of fire coming out of our backsides."
Why should a bot, spidering our material, on our web pages and absorbing our bandwidth be expected to comply to any lesser standard?
I do agree and understand the translation issue between languages.
It's the primary reason that the majority of RIPE users are denied access to my sites. The ability does not exist for me to determine if my materials have been duplicated.
The translation tools are just not functional.
Additionally, if you've ever been fortuante to read something that has been translated from one langauge and then into 3-4 languages and once again back into the original langauge?
The end result is something that is not in any way similar to the original text.
Why should a bot, spidering our material, on our web pages and absorbing our bandwidth be expected to comply to any lesser standard?
I think there is a confusion here: which standard do you refer to? If its robots.txt then I am in complete agreement that bots should follow it, but if its something else then I am not sure I understand what you referring to as you can't possible be expecting a machine to understand millions of different Terms and Conditions pages that are often written by lawyers for lawyers, not humans.
but if its something else then I am not sure I understand what you referring to as you can't possible be expecting a machine to understand millions of different Terms and Conditions pages that are often written by lawyers for lawyers, not humans.
And why not?
Users are excpected to comply with software UAG's when more than 90% don't even read them and the majority that do read haven't a clue or understanding what they are reading.
Google, Yahoo, MS and ALL the others have either UAG or TOS on their websites and expect compliance from visitors.
Why is Jo Schmo's website or Bob's Corner Grocery website any different in excpecting compliance from visitors, whether spidering bot or visitor?
If a webmaster has a page or two which is primarily links and images?
They are in a minor league as compared to many of us that participate in this forum and many other forums.
If a standard or capability doesn't exist?
Then one surely needs to be created.
Msg #6 of this thread.
And also some features that I'd like to see in all bots (some support some, but most don't)
And why not?
Because the best available computers now do not have brain capacity of a bee and won't have it for another decade or so. What you want is a pipe dream that won't materialise for decades to come - whether you like it or not. It can't happen - it won't happen any time soon so you might as well forget it because its technically not feasible. If you are a genious who can prove this wrong then please do it -- I will be the first to bow and this achievement will no doubt contribute to computer science in a big way.
Google, Yahoo, MS and ALL the others have either UAG or TOS on their websites and expect compliance from visitors.
Yes but for visitors, bots are controlled via robots.txt - this is the best standard that we currently have and I agree it needes to be extended, for example Crawl-Delay parameter should really be supported by every good bot. Simple pattern matching probably should be made a requirement. And naturally supporting gzip and ideally If-Modified-Since/checksums to reduce bandwidth usage on all sides.
All of this is sensible, but your overly emotional reaction on bots not reading your T&Cs won't help solve anything whatsoever.