| This 34 message thread spans 2 pages: < < 34 ( 1  ) || |
|Strange 404s from Yahoo Slurp|
For the past few months slurp has been generating a lot of 404's. There are 3 types:
* Genuine 404s from pages which were deleted a while ago.
* 404s from what seems to be badly configured software
* 404s from what seems to be attempts at exploits.
The following are 404s from Yahoo sports pages such as blogs and video sections:
404 GET /nhl/blog/YYYYY/teams/Nashville+Predators/nhl.t.27
404 GET /nhl/players/2848/gallery/im:urn:newsml:sports.yahoo,getty:YYYYYY:nhl,photo,YYYYYYYYYYYY_nashville_pre:1
404 GET /nhl/teams/was
404 GET /nhl/teams/cob
My sector is sports but nothing to do with hockey, or US sports of any kind.
If I look at the referring pages there is no link to my site so is this badly configured software?
The following seem to be some kind of exploit:
myhigheredjobs is I believe a jobsite app which uses a login admin panel. As with the company/contact.cfm and the question/index they are not on my site and they look as if they are trawling for exploits.
The IP address does look genuine:
22.214.171.124.in-addr.arpa name = b3090812.crawl.yahoo.net.
Authoritative answers can be found from:
115.195.67.in-addr.arpa nameserver = ns2.yahoo.com.
115.195.67.in-addr.arpa nameserver = ns3.yahoo.com.
115.195.67.in-addr.arpa nameserver = ns4.yahoo.com.
115.195.67.in-addr.arpa nameserver = ns5.yahoo.com.
115.195.67.in-addr.arpa nameserver = ns1.yahoo.com.
ns1.yahoo.com internet address = 126.96.36.199
ns2.yahoo.com internet address = 188.8.131.52
ns3.yahoo.com internet address = 184.108.40.206
ns4.yahoo.com internet address = 220.127.116.11
ns5.yahoo.com internet address = 18.104.22.168
So what the heck is going on here? Is this some kind of spoofing in order to crawl my site to get past current bad bot blocking and / or exploit trawling?
As I said on another thread here slurp is excessively crawling the site. I am wondering if some kind of spoofing is going on and that I should totally block the IP.
@SteveWh (who I agree with) and others in this thread. I repeat... and probably did so too short and sweet the first time around:
|Other than the bandwidth expended serving a 404, is there really a problem? Hitch up the undies and keep going. |
Some things just don't make sense, but if they don't HURT you then ignore...
Part of my log analysis is to scan the 404s, then SUBTRACT THEM from further analysis (and most anything else in the 400 series) The 500s intrigue me as they MIGHT (usually NOT) indicate other problems coming.
I prefer to think that 404 is actually working correctly: PAGE NOT FOUND and is served a relatively small response from my server. Irritating, but is not dangerous. Locked down my site as firmly as possible and those 404's simply mean I've done my job well.
Meanwhile, Slurp (Yahoo) does send me visitors and I want all I can get! :)
I have, on the other hand, disallowed (via .htaccess) Pacific and Asian Slurp as my sites (in general) have little value to those robot crawls. (different language, little in common)
A good explanation from SteveWH. It would now be nice for Yahoo to confirm what exact is going on with the wall knocking (and the other sports blog / 404s)
I still think this is not taken seriously enough. There is no way I am happy to let anything go asking for admin pages. The issue is not the security procedures in use, but of the ethics of a trusted company.
I ask Caribguy or anyone else who thinks this is nothing to worry about.
Can I visit your site and start asking for admin type pages? Would you mind if I try to find out if you have a particular version of an app installed? You can serve 404s if you wish but what if I find a page which does not return a 404?
If you do mind then why would it not be right for me to do this, but perfectly fine for Yahoo to do this?
Joe owns an apartment block with rooms which he rents out to residents, and has a visitors centre and facilities which potential customers can browse and use.
Joe likes to think the apartment block is pretty much secure. If any bad visitor starts rattling the gates around the back a guard is immediately issued and the bad visitor is politely sent away. If the bad visitor returns a second time the guard issues a total ban and that visitor never returns.
Now and again realtors call by to take photos, and examine the pool in order to produce listings which Joe uses to attract new visitors.
One day Slurp Realtors rattled on the back gates looking for Joes Admin office. This was allowed as the guard had seen Slurp Realtors ID and let them in.
But this worried Joe for the following reasons:
Slurp Realtors had never done this before.
No other apartment block owners have noticed Slurp Realtors rattling gates and knocking on walls looking for admin offices.
No other Realtors have done what Slurp Realtors were doing.
Some say that Joe should not worry about this and get on with running his apartments. But Joe was concerned about this. Each day he heard the gates rattle, and the walls being tapped by Slurp Realtor and this was very unusual.
Joe has a zero-tolerance approach towards security. All visitors are free to walk around the visitor centre, all residents are allowed to walk around the apartment blocks, but only Joe and the guards are allowed to walk around the secure lot at the back.
No-one knows about the secure lot at the back - only Joe and the guards do. So if anyone does go looking for the secure lot they are instantly dealt with.
Joe contacted Slurp Realtors asking them to explain why they were looking for the secure lot and rattling gates but they never replied. Joe had no choice but to remove their Realtors Pass and since then they have not returned.
Joe has lost a tiny percentage of visitors compared to what Googlebot Realtors provides but, hey, Joe is not woken up by the sound of the gates rattling.
Yes, it really is true, I don't mind you or Yahoo looking for admin pages, nor do I mind that a web application I use makes its name and version visible on every page.
If you are concerned about the security of any web applications you use, you can check for reports of security vulnerabilities at [secunia.com...] .
You might try checking your logs for requests from this IP from before the strange requests began. Did they previously make only sensible requests, and then something changed?
Maybe also do web searches for the name of your site in conjunction with a few snippets of the strange URLs from the requests. The goal would be to discover if anyone has placed links anywhere online that would give Yahoo the impression the URLs exist on your site. Although the inurl: searches I did on a few of them turned up only Yahoo pages in the results, you might check more of them than I did.
Does that crawler ask for normal pages in addition to the weird ones?
I said previously that scenarios are easy to invent... I don't know if Yahoo has a manual URL submission form for webmasters. If someone submitted a bunch of bogus pages supposedly on your site, Yahoo would probably crawl to see if they're really there. Even if that scenario were true, it's still no harm done: they're not there, and now Yahoo knows it. Except if you block them, they won't know it.
If there's a mystery worth any amount of concern here at all, it would be how Yahoo "got the idea" about these URLs (and the answer still could be as simple as an unusually thorough Confirm404 crawl). One of your initial concerns was about a misconfigured crawler, which could still be what the problem really is. Or a "misconfigured internet" that has bogus links to your site somewhere. That, however, seems less likely if you're not getting requests from Google for those same pages.
I allowed slurp back in again and within a few days it started asking for
etc. all again.
|...Or a "misconfigured internet" that has bogus links to your site somewhere. That, however, seems less likely if you're not getting requests from Google for those same pages. |
Exactly. Only slurp is doing this and I can find no pages with +mysite +myHigherEdJobs and the variations on any of the main SEs.
Let me go back to the yahoo.sportsbook angle and follow the line that this is a misconfigured yahoo serp or something.
22.214.171.124 - - [16/Jul/2010:17:09:11 +0100] "GET /nhl/blog/puck_daddy?author=Matt+Romig HTTP/1.0" 404 6188 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...] 0 example.com "-" "-"
Matt Romig is a senior editor of Yahoo Sports Blog, and Puck Daddy is an NHL blog edited by Greg Wyshynski.
Is this a misconfigured bot, or is this actually a case of log file spamming?
Maybe I would get more answers from Matt Romig or Greg Wyshynski as Yahoo support are no longer replying to my questions.
| This 34 message thread spans 2 pages: < < 34 ( 1  ) |