SteveWh - 2:20 am on Jun 26, 2010 (gmt 0)
Some comments... some of it might be useful or interesting:
No-one has ruled out a forged IP
If someone sends a request with a forged IP address, they won't get a reply from the server because the reply would go to the IP address they provided - the forged one. A forged-IP attack sent against your site, to serve any useful purpose, would *have* to be sent with the attack payload (it's no use at all for site probing), because the sender would know that they'll get nothing back from a HTTP request that gives a forged IP return address. The attack scenario doesn't seem likely because none of the URLs you listed, except the somewhat lame-looking XSS one, were any types of attack at all.
On the other hand, this article ( [en.wikipedia.org...] ) describes another use of IP forgery, in denial-of-service attacks. If (a big IF) this is the case in your situation, the attack would be against Yahoo (not you) - by flooding Yahoo with responses to requests that they hadn't sent.
Your blocking the requests might trivially reduce the amount of data being sent back to Yahoo (if your 403 response is shorter than your 404 response), but not reduce the number of responses. Yahoo will simply receive your 403's instead of your 404's.
IF this is the case, then your contacting Yahoo at least once was probably a good idea, because otherwise they could think your site was attacking them. Also, if you provided them with the complete requests from your log, they might have been able to check those against the requsts that they actually sent out. If they didn't match, it might, somehow, help them with respect to a DDOS attack underway or planned.
I disallowed slurp * yesterday at 7am EST, by 3pm EST it had stopped crawling.
That would seem to suggest that the sender was receiving your responses and adjusted their behavior accordingly. If so, it wasn't IP address spoofing. I guess you can't really be sure they stopped crawling *because* of your ban and not just coincidentally at the same time as the result of your previous contacts with them.
Or maybe your site was being used in a DDOS attack that was underway, and Yahoo found the source and got the originating server shut down so it's not sending the requests to your site anymore. Once you start trying to analyze scenarios, there can be an overwhelming number of possibilities.
Login pages can't be classed as sensitive pages. If you live in a home, everybody in the world knows your home has a door. If you use Wordpress, it has an administrator login page with a standardized name. Same for Joomla or any other web application.
It really doesn't matter if people or crawlers go snooping through your site looking for login pages. If you don't have the login pages they're looking for, there's no harm done. If you do have the login pages they're looking for, your "door" is protected by a super-strong password (right?), and there's still no harm done and no harm possible.
It still doesn't make it right for Yahoo (or a rogue 3rd party) to go tapping the walls to find out where the safe is.
As long as you keep the safe locked, it really doesn't matter how much wall tapping goes on, except for the probably small amount of bandwidth used by the 404 responses.
In the case of Yahoo (and any search engine crawler), it's their job to tap the walls and find out what pages are or are not in a site.
If you DO have pages (or any files or directories at all) that you DON'T want crawlers to run across, password protect them. Don't list them in robots.txt. Just password protect them.
Is it now Yahoo policy to deliberately and pro-actively sniff for admin and login pages.
Yahoo or Google or some other crawler might do that for some reason, such as to create estimates of internet web application usage. They also crawl looking for embedded malware, and probably also for other reasons. They're crawlers, and they make statistical models of the internet, and who knows what else.
Has someone worked out how to manipulate serps in order to find admin and login pages.
No. Those can be found by ordinary web searches.
If those IPs were Yahoo, they got 404 responses indicating that those pages don't exist on your site, which is correct.
If those IPs were not Yahoo, they got no response at all, and still have no idea whether you have those pages or not. And even if they knew you had them, it wouldn't make any practical difference.
A scenario not yet mentioned was the possibility that someone hacked the Yahoo crawler server(s) and reprogrammed them to do these requests. My proactive response to that is No. Anyone with the sophistication to do that would have used them for a lot more sophisticated purpose than these lame requests.