Forum Moderators: goodroi
Very useful forum! The robots.txt file of a large web site I'm looking at is:
User-agent: *
Disallow: /admin
Disallow: /empdir
Disallow: /jobman
Disallow: /jobsearch
Disallow: /reports
Disallow: /talentmatch The few pages I intend to parse (/seeker.epl) are at the root level and this seems to be allowed, right? HTML files have no META tags for "robots". But here's is what I found on their Terms and Conditions page:
While using the Site or Site-related services, you agree not to do any of the following without our prior written authorization:
...
Use any search engine, software, tool, agent or other device or mechanism, including without limitation browsers, spiders, robots, avatars or intelligent agents (other than those made available by the Site or other generally available third party web browsers, e.g., Netscape Navigator or Microsoft Internet Explorer), to navigate or search the Site.
It is not in my intention to do something illegal with my spider or abuze of their bandwidth. The already public links I intend to collect will after all send people on their web site for the actual content.
Question is: Which restrictions should prevail: those from their Terms page or from robots.txt? Can I use the spider according to robots.txt and simply ignore the rest?
Thanks so much,
Christian
without our prior written authorization
That said, there is no way Googlebot has time to read terms of use! That is exactly what robots.txt is designed for. So if they moan at you, they should also moan at Google... then again, they probably don't want to do that, which is their right.
Anyway - they know how to use robots.txt, so if they want to ban your user aganet then (presumably) you will let them by recognizing a commend for your bot should one arrove... but getting an OK from them in an email would be they best solution for you if you can.
I'm afraid asking them permission wouldn't be recommended in this case. I'm talking here about a company with revenus of 1M$ a day and huge traffic, not about a guy whose site's bandwith can be affected by my robot! And I wouldn't want to give them ideas about what I do. Anyway, my indexing engine would also use other sites.
As said before, there is no robot to look and read the terms on sites and, as long as you obey robots.txt and access is not denied, nobody can complain.
-------------------
I would also have another concern. It appears most of you guys have their own site(s) and do not usually welcome robots (Googlebot is a lucky exception ;-)). I understand your attitude, I have my own small non-profit site and it bothers me when I see robots opening hundreds of connections and picking up email addreses for spam.
But, with so much info on the Internet, there is a huge need for friendly robots to index specific data and present it in a more intelligent way, and to create new value for the inet user. I trully think site owners should not deny access to any kind of robot, just because they can (actually, they can only say in robots.txt when spiders are not welcome). There are no laws at this time for these issues, but hopefully we'll see some in the near future.
Cheers,
Cristian