Writing a bot for my site to check links.

Forum Moderators: open

Message Too Old, No Replies

Writing a bot for my site to check links.

Is robots.txt validation required?

Duskrider

8:54 am on Jun 20, 2006 (gmt 0)

Hello everyone,

I'm in the process of writing a spider/bot for a website I'm putting up, and I was wondering about the need for it to follow directives in robots.txt.

The spider will check links for a specified domain, simply getting the linked page and withdrawing again... making sure the page exists. It's not even checking the contents.

My question to you is, does it need to get and respond to the robots.txt file for the linked site? My immediate thought would be no... since it's not really doing any spidering or indexing, just checking for valid links. However, I don't want to be rude and assume as much. Making the spider validate and follow a robots.txt file really isn't that difficult, so I would rather write it that way than upset anyone. However, if it isn't needed then I'd rather not waste my time writing the code for it.

Thanks for the input!

jonrichd

11:12 pm on Jun 20, 2006 (gmt 0)

You don't say how this spider will be used -- if it's just for you to use, or if others will be able to use it. My guess would be that most of the folks here would say you should follow the wishes of robots.txt - people have enough problems with rogue spiders using bandwidth.

That being said, if the purpose of this spider is just to check outbound links from a website, returning some sort of an indication that you couldn't spider due to robots.txt would tell you the links you needed to check manually.

Duskrider

1:14 am on Jun 21, 2006 (gmt 0)

The implementation is planned to be on a site where users can check the links of their site by giving the spider their URL. It then spiders their site, looking for all links and following links within the specified domain to continue the link check further on the site. It also, of course, checks outbound links by retrieving the linked resource and making sure it exists.

I plan on having the spider read and follow robots.txt for the specified domain as an option to the user, simply so they can test if their robots.txt is valid and works. I was hoping I could aviod needing to hit external domains up for two files rather than just the file I need to check since I'm not really doing any crawling.

I'm still not 100% sure how I'm going to implement the spider and how it will crawl the user's site, but I know for sure it won't crawl externally.

wilderness

1:22 am on Jun 21, 2006 (gmt 0)

The implementation is planned to be on a site where users can check the links of their site by giving the spider their URL.

Will there be some method of verification on whether "their URL" is actually their URL?

Or may naybody visit the site an insert any URL they please and a few moments later another URL?

Duskrider

2:03 am on Jun 21, 2006 (gmt 0)

I don't plan on requiring the user to validate that it's acutally their URL before running the check... so it's possible they could check several different domains in a short period of time. If they don't validate that the domain is theirs in some way, however, my spider will follow robots.txt on the site they list for sure. (As well as robots meta tags)

My plan at this point is to include an option for the user to either allow my spider to ignore their robots.txt for link check purposes or follow robots.txt for testing purposes. If they choose to ignore robots.txt for the site they list, there will have to be some form of proof the site is theirs. This is only something I've come to think about after posting in this thread initially, which is why I'm here rather than blindly coding my spider. :)

wilderness

3:10 am on Jun 21, 2006 (gmt 0)

Many thanks for taking the time and making the effort to recieve feedback and answer questions.

bull

5:42 am on Jun 21, 2006 (gmt 0)

It's not even checking the contents.

So only HTTP HEAD requests will be used. I therefore do not see a requirement for robots.txt support, if the linkchecker clearly identifies itself as a such and offers a info URL in the User-agent string. I am using exactly such a thing myself.

Jan

incrediBILL

6:50 am on Jun 21, 2006 (gmt 0)

I don't plan on requiring the user to validate that it's acutally their URL before running the check

If that's the case, I'd probably block you, sorry, but willy-nilly accesses to my server just get under my skin. Oh who am I kidding, you're already blocked, I whitelist, I don't blacklist ;)

You should really do an email validation of the webmaster on the domain before crawling for a couple of reasons.

a) people with very large sites will be ticked

b) abusive idiots will aim your crawler at something like yahoo.com and giggle until they fall off their chair at your wasting your own bandwidth

Duskrider

8:49 am on Jun 21, 2006 (gmt 0)

Thanks for the info everyone.

incrediBILL - I have given serious thought to the possibility of users putting in rediculous querys like yahoo.com and ebay.com. I'm not yet to the point in development of my spider where I'm addressing that. I'm trying very hard to make sure I cover everything, so I'm taking my time and double checking every last bit of functionality before I release it anywhere other than domains I own.

I was thinking along the lines of asking the user to put a file in the root of their domain (or the root of the account they wish to check) for my spider to check and validate that it's actually them. I hadn't given much thought to e-mail validation, I'll have to look into it. As for blacklisting me, that's ok, because unless you want to use the service I'm offering I should have no business being there in the first place. :)

I'd hate to require that extra step for anyone wanting to use the service, but I may be forced to do so anyway. One of the many things on my list to think about. I appreciate the input.

Matt Probert

12:29 pm on Jun 21, 2006 (gmt 0)

IF the spider is to check the links that a human reader may follow, then it should behave like a human reader might, and particularly ignore a robots.txt file.

This can lead to bad feeling among web masters, to redress the balance your spider should be very conservative in its requests, if it bangs out a dozen threads at a time you will find it blocked widely. While a single page request every 60 seconds may not even be noticed.

Matt

Pfui

6:04 pm on Jun 21, 2006 (gmt 0)

Duskrider, if only all bot-crafters were as polite and interested in their fellow webmasters' opinions as you are! Thank you for asking for info rather than taking us to task us over how each of us chooses to stem the tide of too many ill-behaved bots. Alas, not being a bot code pro, all I can offer is this Wish List...

PLEASE --

1.) Only real people in real time. Thus --

2.) robots.txt should be read and complied with by all 'unmanned' agents. There should be no option to override. Ever.

If people want to crawl and/or link-check their site but they have robots.txt files Disallowing same, no prob. They can simply rewrite or re-title their file rather than running rough-shod over mine.

Okay. Moving along to the get-real elements:)

3.) No spoofed UAs; no goofy IDs.

As mentioned in a prior post, truthfully ID your agent in the string. AND require a verified contact address, preferably the bot-runner's, if not your own. Don't let someone run anything with "anonymous" in it, e.g., the too-precious:

"Anonymous/0.0 (Anonymous; [anonymous.com;...]  noreply@anonymous.com)"

Ditto anything akin to the (in)famous --

"larbin2.6.3@unspecified.mail"

-- or the Just Plain Stupid --

"Firefox/1.0.1 (someone@somewhere.any)"

(Aside: Yes, I'm talking about you, .gaoland.net. FYI, you're next on my to-firewall list. Merde.)

-- or our very own incrediBILL's Nemesis of the Month:

"(Nutch; [lucene.apache.org...]  nutch-agent@lucene.apache.org)"

Oh, and don't include A HREFs in the string in hopes you'll stand out in log files and/or get hit on if the site's stats are public (yikes). The fastest way to get me to deep-six you is for you to mess with my logs. E.g. this actual string (name/sitename changed):

"<a href='http://www.example.org'> Example Blah Blah Organization </a> (info@example.org)"

Gimme a break. That's not a string, that's spam.

Lastly, spare us any "What am I?" identity crises:

"WIRE/0.11 (Linux; i686; Robot,Spider,Crawler,aromano@cli.di.unipi.it)"

4.) No lwp-anything. (See #8.)

5.) Unmanned agents should not request .ico files.

I'm currently beating off hundreds of requests for these every day from Google's Desktop and Toolbar users, ditto proxymsn.com Hosts and MSN IPs. The same Hosts and IPs day after day, hitting the exact same .ico two and three times/second.

6.) Unmanned agents should not request .jpg, .gif, .mid, .js, etc.

Don't make me pay for stuff you don't need.

7.) Unmanned agents should report server errors AND heed 403s.

Errors should be BIG , bold, red. Also, Forbidden means, "No." Get a clue. Go away.

8.) Link-checkers should not run more than once a month, tops. If that often.

I don't know how this limit could be doable but it really frosts me when people run link-checkers and then ignore red flags. A school in NYC runs "lwp-request" against our front door twice a day. A public library in Michigan runs "Checkbot" (an LWP variant) against two 9 year-old URLs every single night, and has for approx. seven years. I've long 403'd those UAs but apparently no one bothers to review their checkers' results because they're still at it.

Another person ran a manually entered bookmark-checker every month, and also never checked the results. So every month, like clockwork, errors, errors, errors. I finally redirected the IP to an e-me page, which they did, and I explained the problem. They acknowledged not looking at the results -- for years. Uh, yeah, I could tell.

(What I should do is send all of these guys a bill for monitoring their uptimes.)

Still another library-related services site with multiple sub-domains ran Apache's "Jakarta Commons-HttpClient" against us hundreds of times whenever they did somethingorother. Multiple e-mails were unreplied-to. Even two phone calls were unreturned. So we firewalled 'em.

9.) Link-checkers should not check (let alone link-to in the first place) pages ending in .cgi or .pl, or .shtml, etc.

We have CGIs whose output changes URLs upon archiving. We also have doorway pages when boards close for holidays and such. Despite posted requests to the contrary, people routinely link to the temp posts/pages and checkers mean errors, errors, errors. Again, if someone ever looked at the results, fine, but they don't. I finally end up going in and 302'ing the Hosts/IPs from the directories to get SOMEone's attention. (The programs are already blocked.)

10.) The program CANNOT be run again if there are outstanding errors in the prior results.

Perhaps not so coincidentally, nos. 8 and 9 would be solved by making a checker real-time only. That way people can't schedule it, or cron it. They have to actually eyeball the results and check/fix or remove broken or Forbidden links before running the program again (and again and again and again and again and...).

.
Well. That's all for now. Hope you're not (too) sorry you asked!

Basically when someone can do something without thinking or use any program so they don't have to think, all too often that means more work for me. So if you can code things such that people have to actually engage their brains, please do -- and then Sticky me with how I can invest in your genius:)

jdMorgan

12:54 am on Jun 22, 2006 (gmt 0)

Duskrider,

As you can see from the various responses, some Webmasters are laissez-faire, and others are adamant about blocking unknown or potentially-abusive user-agents.

My wish list echoes the others in mosts aspects:

Use the HTTP HEAD request, not a GET.

Provide a meaningful user-agent string with a link to a page explaining what it is used for.

Handle 403-Forbidden error responses carefully; Such a response may mean that your user-agent is blocked, but does not mean that the link is actually broken if visited by a browser.

Limit the rate at which link-targeted pages on a single site can be checked. IOW, if the site being checked links to 150 of my pages, please don't request them all at once. If you hit a dynamic site on a small/slow shared server at that rate, you could crash it. Do that enough times, and you could easily end up on a lot of public 'permanently-ban' lists.

One thing you could do to placate Webmasters whose pages are being checked as link-targets would be to provide that other site's linking URL in the HTTP_REFERER header. This would allow any Webmaster whose site was 'victimized' by someone using your tool maliciously to at least report the site that was being checked by your spider. That, combined with some sort of user-validation on your site, would help stem abuse.

There is a lot of abuse on the Web, so you should view user-authentication and 'polite' accessing of target pages not as a courtesy to other Webmasters, but as a survival technique for your own tool. Larbin, LWP, Nutch, and several other very useful projects naively ignored abuse-prevention, and as a result, are now banned by many servers. Assume that your service *will* be used to abuse other sites, and work from there, rather than making the mistakes these others made.

Thanks for asking,
Jim

Duskrider

1:27 am on Jun 22, 2006 (gmt 0)

Wow, some really great responses here, thanks!

As I'm writing my code I'll have to keep all of this in mind. Many of the points that were brought up, while valid, won't apply to my spider simply because it's a real time crawler and doesn't do anything via cron or perpetual motion. The only time it will crawl a site is when it's specifically asked by a real person who's typing that address into a form request.

That said, I'm still thinking about user validation. That should negate most chances of my spider being used maliciously, though I won't ingore the possibility even with validation.

At this point I have my UA string set to something like
"blahBot 1.0 - blahblah.com (http://www.blahblah.com/blahbot)" Blah, of course, is just for example since we can't be specific. I'll add a contact e-mail to that next time I sit down and start programming.

Thanks again for all the info. You've given me a lot to think about, and I'll make sure to reference this thread while I'm working on my Bot!

-DR-

jdMorgan

2:07 am on Jun 22, 2006 (gmt 0)

I posted this in another thread [webmasterworld.com] just now, and it may be helpful to you as well: Standard User-agent strings [mozilla.org]

Actually, that whole thread might be useful to you as an example of what can happen, and how fast.

Jim

thetrasher

10:54 am on Jun 22, 2006 (gmt 0)

The only time it will crawl a site is when it's specifically asked by a real person who's typing that address into a form request.

How do you know that it is a real person filling up that form? Especially in the web, it is sometimes hard to recognize whether it is a human or a machine.

Duskrider

5:39 am on Jun 23, 2006 (gmt 0)

Compared to everything else, that's the easy part.

I'll just use a CAPTCHA script to display text that humans can read but bots can't. In order to use the service, you'll have to enter the text in a CAPTCHA graphic correctly.

Mr Bo Jangles

6:27 am on Jun 23, 2006 (gmt 0)

This would be a really useful little project for Mr Google to throw a few $s at - I would hazard a guess, that more people might use this than their new web spreadsheet!
A new Google service that webmasters can sign up for (free of course) that will check all internal and external links on their web site and send them an e-mail reporting that one or more are broken - you then click on a link in the e-mail to be taken to your broken link report.

Sign me up!

Pfui

6:52 am on Jun 23, 2006 (gmt 0)

Instead of the CAPTCHA stuff (which, to be really good for all, needs to be really readable by all, please -- those 1s and ls and Is always trip me up:), how about having potential -- or paid -- users e-mail you with their info, site name and URL, etc.? That way, you not only confirm they're real, but also that the site they're wanting to link-check is theirs.

Why the extra step?

Well, for years, a tool-type site has offered free link-checking to anyone who types in any URL. (Few things are more obnoxious, and bizarre, and troublesome, than seeing someone run your URL through any tool site.)

I just revisited that same site and now they offer link-checking, spell-checking, assorted compatibility tests and more. No restrictions, no impediments, just basically a fill-in-the-URL form to see what I'd call 'free samples' of their more extensive, paid-for services.

So I typed in a site's page I know will rewrite to: [127.0.0.1...]

"This report shows possible spelling errors on your page."
"Warning: tag <script> missing required attribute"
"attribute "valign" has invalid value "center"
"Total Incompatibilities: 31"

All results, of course, from their own page. Apparently they don't use their own tool: )

Point is, I think tools sites (& programs) are too loose when it comes to letting anybody run anything against any URL. (Last year, and on his own accord, a pro I know pointed a monitoring program toward our network and it blew errors everywhere. I'm still irked with him!) Anyway, if people at least have to e-mail you with 'their' info, with their e-mail being their acceptance of your TOS, then you know who they are, and where, if they abuse a site using your tool.