Forum Moderators: goodroi
By the way...
I thought your name/inquiry looked familiar. You're with the Web Mining Project Blog [sjb659.blogspot.com]. Your bot came around my site(s) x2 in February and, thankfully, respected robots.txt and included info in its (albeit very, very long:) ID string.
Alas, your March 2 entry indicates "Jobo" will not be respectful in the future? Emphasis mine --
"Customizations include...
"Crawl the URLs in the robots.txt. This would violate the robot exclusion standards but our goal is to collect statistics and analyze the same."
Please know that I, for one, will be most unhappy if Jobo behaves that way. You'd not only be violating robots.txt but my sites' Terms of Use, too.
So if you still plan on crawling specifically Disallowed URLs -- or any, actually, in the case of a site-wide Disallow -- could you please reply with your crawler's ID and/or IP info? The first time around, it came from:
149-159-3-192.dhcp-bl.indiana.edu
Thank you!
2)Indeed myself and my partner are working on this prohect that you have indicated below and these efforts are towards that.We are graduate students of computer science in Indiana university.
3)As part of our project, we do intend to crawl the URLs in the robots.txt but we do not save the file but rather just get the size.It would be a one-time crawl basically until we get a complete set of data for 2 levels of the web.
4)We have dynamic IP addresses and hence cannot specify a particular IP address.However you could contact me at sajay@indiana.edu if there are any other specific queries that you wish to ask.
Apologise for the inconvinience.
Smitha
Note that for most sites, except possibly the ones belonging to the ten pizza shops closest to Indiana U., it's no problem for most of us to block the University's entire IP range. And intensive traffic such as your project would generate would be most unwelcome at most proxy servers, so that's not a good work-around...
Nothing personal, but you just flat don't violate robots.txt for any reason, "research" notwithstanding -- any more than you might shoot all your neighbors' dogs "for research" -- Tell it to the judge.
So, I don't know what statistical effect all these 403 responses will have on your data, but if you won't follow the Standard, don't expect people to tolerate your robot.
As to changing the IP address and/or User-agent name, that makes it a little more difficult to block your robot, but far from impossible. Bots can be blocked behaviourally, as well as by these simple methods. And several of the scripts to do this are posted on this site.
I suggest you discuss this approach with your professor -- Or maybe I will. Have done several times with other ill-advised "uni projects."
You may also wish to consider the legal ramifications of this plan. Make sure you stay far away from .gov and .mil sites, and corporate sites backed by large legal departments.
As Webmasters, we welcome you as part of the community. But don't break the laws of the community if you wish to remain welcome in it.
Google did a robots.txt validation project of this type several years ago. Their results may still be available if you ask nicely. You might even get some implementation advice or code from them.
Jim