Details:
2003-02-24 11:59:38 194.74.151.201 - W3SVC48 RASRV02 nnn.nnn.nnn.nn 80 GET /robots.txt - 200 0 450 130 31 HTTP/1.1 YellSpider - -
That IP resolves to:
inetnum: 194.74.151.192 - 194.74.151.207
netname: BT-CUST-983
descr: Yellow Pages
country: GB
admin-c: WG219-RIPE
tech-c: SW239-RIPE
status: ASSIGNED PA
mnt-by: RIPE-NCC-NONE-MNT
changed: Peter.Lee@bt.net 19961217
source: RIPE
Welcome to WebmasterWorld [webmasterworld.com]
The Robots.txt file (if it exists) provides every spider with information about which parts of the site can or can't be spidered.
A well behaved spider will automatically look for a robots.txt file before it proceeds further.
My point is they may be spidering all of the content from each of their customers sites. I've just finished a project for Thomson Local to do exactly this.
You're right - if you want to spider the whole site you would check robots.txt first. If you were simply Yell checking a single url held against your customer records, why bother? Surely you would just test that single URL was valid?