Validating robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Validating robots.txt

Research on valid robots.txt and usage

smitha_ajay

6:30 pm on Mar 13, 2006 (gmt 0)

We are doing a research project to studythe usage of robots.txt and what percentage of the web it covers.

As part of this we are trying to validate robots.txt to find whether it is valid or not.

Is there any Java code that we can reuse for the project for validating the same?

Thanks
Smitha

Pfui

7:49 pm on Mar 13, 2006 (gmt 0)

I'm not sure I understand your question -- you want Java code to analyze and/or validate others' robots.txt files?

By the way...

I thought your name/inquiry looked familiar. You're with the Web Mining Project Blog [sjb659.blogspot.com]. Your bot came around my site(s) x2 in February and, thankfully, respected robots.txt and included info in its (albeit very, very long:) ID string.

Alas, your March 2 entry indicates "Jobo" will not be respectful in the future? Emphasis mine --

"Customizations include...

"Crawl the URLs in the robots.txt. This would violate the robot exclusion standards but our goal is to collect statistics and analyze the same."

Please know that I, for one, will be most unhappy if Jobo behaves that way. You'd not only be violating robots.txt but my sites' Terms of Use, too.

So if you still plan on crawling specifically Disallowed URLs -- or any, actually, in the case of a site-wide Disallow -- could you please reply with your crawler's ID and/or IP info? The first time around, it came from:

149-159-3-192.dhcp-bl.indiana.edu

Thank you!

smitha_ajay

4:29 am on Mar 14, 2006 (gmt 0)

1)I need Java code to check if the robots.txt collected by crawling various sites conform to the robots exclusion standard.For instance, i found that some robots.txt have contents in them like 'Please do not crawl our site' etc...
So, we would like to check the percentage of robots.txt's that have valid content.

2)Indeed myself and my partner are working on this prohect that you have indicated below and these efforts are towards that.We are graduate students of computer science in Indiana university.

3)As part of our project, we do intend to crawl the URLs in the robots.txt but we do not save the file but rather just get the size.It would be a one-time crawl basically until we get a complete set of data for 2 levels of the web.

4)We have dynamic IP addresses and hence cannot specify a particular IP address.However you could contact me at sajay@indiana.edu if there are any other specific queries that you wish to ask.

Apologise for the inconvinience.

Smitha

jdMorgan

5:18 am on Mar 14, 2006 (gmt 0)

Sounds like a good way to ruin your project if I understand what you're planning -- Here's what happens:

Your robot fetches robots.txt.

It then violates the Standard for Robot Exclusion by fetching files specifically disallowed in robots.txt (N.B.: The Standard says nothing of whether you store the results; It is a violation to fetch Disallowed files.)

On my sites, and those belonging to clients, this triggers an immediate and automatic block by IP address, and under certain circumstances, by User-agent.

So all the content you fetch from my sites reads something like, "403-Forbidden - Access denied. Click here for more information."

Checking my site reports, I then log into WebmasterWorld and report your robot to others here. Those who run scripts similar to mine post in response, confirming the aberrant behaviour of your robot.

Others who don't run access-control scripts see the posts, and simply add the info we report to their manual block lists.

The word spreads on other Webmaster forums, Web security sites, blogs, and internet block list sites.

Your robot is toast, or at least the results become badly-skewed over time... Hours and days, not weeks.

Note that for most sites, except possibly the ones belonging to the ten pizza shops closest to Indiana U., it's no problem for most of us to block the University's entire IP range. And intensive traffic such as your project would generate would be most unwelcome at most proxy servers, so that's not a good work-around...

Nothing personal, but you just flat don't violate robots.txt for any reason, "research" notwithstanding -- any more than you might shoot all your neighbors' dogs "for research" -- Tell it to the judge.

So, I don't know what statistical effect all these 403 responses will have on your data, but if you won't follow the Standard, don't expect people to tolerate your robot.

As to changing the IP address and/or User-agent name, that makes it a little more difficult to block your robot, but far from impossible. Bots can be blocked behaviourally, as well as by these simple methods. And several of the scripts to do this are posted on this site.

I suggest you discuss this approach with your professor -- Or maybe I will. Have done several times with other ill-advised "uni projects."

You may also wish to consider the legal ramifications of this plan. Make sure you stay far away from .gov and .mil sites, and corporate sites backed by large legal departments.

As Webmasters, we welcome you as part of the community. But don't break the laws of the community if you wish to remain welcome in it.

Google did a robots.txt validation project of this type several years ago. Their results may still be available if you ask nicely. You might even get some implementation advice or code from them.

Jim