What sort of information? Anything they can find! The spiders collect information without ascribing any meaning to it. If they find a link, they follow it. They download all of the 'allowed' pages of your site for later analysis. If you are marketing, then it's a good thing if they take all of your pages and list them in search results. If you are providing an online service, it may not be a good thing, since you may not want all of those pages showing in search results.
If one of your users has placed a link to the webmail directory somewhere on the 'net that a spider finds, it will crawl into the webmail directory. One link is all it takes.
As to why some spiders and not others, who can tell? Some spiders are more aggressive because their owners provide them with more bandwidth, processing power, and disk space so that they can dig deeper and retrieve more Web pages.
If your users' accounts are password-protected, and there are no "back-door" entries to bypass the password authorization, then you should be OK as far as their "personal" inforamtion being safe.
I'd strongly suggest you put up a robots.txt that disallows robots from your 'sensitive' areas, though. Being in control of the spiders, instead of at their mercy, is a good thing. Disallow the good spiders from sensitive areas of your site by using robots.txt, and block the bad spiders that don't read or obey robots.txt using other means (e.g. ISAPI filters on MS servers, mod_rewrite on Apache). (You define 'sensitive' - it varies from site to site.)
Out of your list of user-agents, the only one I'd allow without a lot more investigation would be GoogleBot.