Forum Moderators: open

Message Too Old, No Replies

Yahoo! directory blocking my bots?

         

Bjorn Iceland

2:50 pm on Oct 14, 2004 (gmt 0)

10+ Year Member



I run a link popularity service that has been checking the Yahoo! directory for the past year without problems in order to watch certain Yahoo! directories.

A couple of weeks ago Yahoo! began to block the server on which my script is running for a few days. I tried moving it onto another but clearly it must have to do with the number of request I am making.

I would like to adhere to Yahoo!'s policy on this (for example I use the Dmoz RDF for searching there so as not to hit their site).

Has anyone experience of this?

Best,
- Bjorn

Brett_Tabke

12:29 pm on Oct 15, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Ya, you are in violation of the Yahoo TOS. No bots are allowed to scrap Yahoo in that manner. In order to gain access again, you will either have to get a new ISP or beg with Yahoo and grovel.

Bjorn Iceland

9:59 pm on Oct 17, 2004 (gmt 0)

10+ Year Member



I do find your answer interesting. The blocks are temporary lasting for a few days at a maximum for the server concerned, but do of course disrupt my service.

I feel the taint of hypocrisy from Yahoo! here if what you express is indeed Yahoo's position. This seems like "ex post facto" action to me. It would appear to be a form of bad faith to permit something for a year and then stop it without explanation.

Please try these URLs:
[yahoo.com...]
[dir.yahoo.com...]
(http://www.searchengineworld.com/cgi-bin/robotcheck.cgi is very useful.)

Do you find the same that my robot shall find? If Yahoo does not follow the robots.txt convention on its site, and permits me to access it for a year as I was doing without problems, why should I be required to pay any penance to Yahoo?

In the spirit of equity let us apply the same type of thinking as you are expressing here to Yahoo! themselves.

On sites that do not have a robots.txt do Yahoo operators manually check the Terms of Service when they are crawling it themselves just in case this is not permitted?

Do they in fact even bother to respect robots.txt files? Perhaps not: [webmasterworld.com...]

(My own experiments with my own SpiderAware system support this disregard for robots.txt by Yahoo's spiders.)

Therefore if they do not respect robots.txt files on others' sites, do not have robots.txt on their own site, then it is hypocritical of them to demand that I do not crawl their site.

I do not care if they are the number one site on the Internet, they are to be bound by the same conventions as the rest of the Internet. Because one is bigger and stronger than another does not mean they are exempt from custom and accepted practice. In fact I would argue that they should be more bound than other sites to present a good example.

So, Yahoo uses my sites' bandwidth, and I do the same for them. I would argue that I bring at least equal value to their site in increased page views of their advertising by the service I provide to my customers that monitors their site, as their bandwidth that I consume.

But if that is not good enough value for value for Yahoo then perhaps I should instead start an association of webmasters who wish to encourage Yahoo to practice what it preaches without hypocrisy: to manually read the Terms of Service of every site that does not have a robots.txt file, and also simply to follow the accepted custom of robots.txt files.

Perhaps then it shall not be the ones who are currently the 'small guy' who shall end up doing the 'grovelling'.

WebFusion

2:34 am on Oct 18, 2004 (gmt 0)

10+ Year Member



The again it's their index/company. Like any other american company, they have the right to refuse service.

I've always thought it a bad idea to base a business on an entity one has no control over (I actually did that once - big mistake on my part).

ogletree

3:02 am on Oct 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Everyone does business with someone they don't have control over. What you mean is putting all your eggs in one basket. There are any number of factors that you don't have control over that can shut you down.

Bjorn Iceland

11:46 am on Oct 18, 2004 (gmt 0)

10+ Year Member



I am most certainly an advocate of freedom of association and free markets (as are the vast majority of my countrymen).

However, if there is one thing that gets the blood of Icelandic people boiling it is hypocritical behaviour, and then to put the icing on the cake, being asked to "beg" and "grovel" because of it.

In my experience the English and their cousins the North Americans have a similiar attitude, although it is far more diluted these days.

Lord Majestic

12:16 pm on Oct 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It not the issue of resources or having correct robots.txt - its about some company (you) making money of something belonging to other company (Yahoo) and there is no gain to that other company (Yahoo) in you trying to make a living. Shame really - they should have gone Ebay's route and provide APIs for reasonable money to enable 3rd party software that they would have been able to control tightly - it would have cost less as well as bots would not have needed to experience overheads of HTML.

Bjorn Iceland

2:25 pm on Oct 18, 2004 (gmt 0)

10+ Year Member



Yes, I am making money because of this, and for that I make no apology.

Yahoo! are certainly making a little money because of my service. A side effect of my service is page views of their advertisments to my customers, as well as the odd new paid listing for them. Many of my customers would not otherwise look at the Yahoo directory as much.

I agree that they should provide an API (paid or not) to third parties.

Dmoz provides a RDF on a weekly basis by way of example (http://rdf.dmoz.org/rdf ). Google use Dmoz and so are certainly more in tune with what would good for them than Yahoo in this area.

Being fair to 'small guys' like me is a positive sum game. Google sees that, but Yahoo seems to think it's a zero sum game.

The hypocrisy of their approach to robots.txt is of course indefensible.

By way of analogy bad faith is allowing someone to drink from your private stream for a year and then suddenly stop certain drinkers from doing so without explanation when the whole point of your business is to allow people to drink freely from that stream.

Lord Majestic

2:53 pm on Oct 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You have my sympathy but just because rightful owner of a private stream did not enforce his/her rights does not mean you gained any rights - be grateful you were allowed to drink from that stream for a year and move on after having said thanks for being allowed to drink from it for so long.

WebFusion

12:32 am on Oct 19, 2004 (gmt 0)

10+ Year Member



Everyone does business with someone they don't have control over

I should have been more clear. The product of mine that I had was a piece of software that relied on another company's data to function (much like the starter of this topic). When that company decided to no longer allow our system to interface with theirs, it effectively killed that product.

having said that, I see no problem with Yahoo (or any other engine for that matter) restricting how their data is used/accessed.

Here's analogy for you:

About 2 years ago we decided to put in a large in ground pool in our back yard. Now, all the properties in my area sit on about 3-4 acres of land each, and there were no (at the time) fences of any kinds between properties.

Shortly after installing the pool, I began to catch the neighborhood kids swimming an playing in it without our permission. This continued without our permission until we installed a privacy fence to keep them out (along with a nice big dog ;-)

Now...am I obligated to give those kids access to my pool simply because they were using it befoer without my permission? I think not.

Nor is it "hypocritical" for me to forbid them to use it.

Yahoo's data is it's own to do with as it wishes. There's nothing stopping you from building a competing product, but you can't expect a for-profit company to simply allow mining of their data by automated systems without compensation.

Bjorn Iceland

10:59 pm on Oct 20, 2004 (gmt 0)

10+ Year Member



Yahoo's data is it's own to do with as it wishes. There's nothing stopping you from building a competing product, but you can't expect a for-profit company to simply allow mining of their data by automated systems without compensation.

As I say, their business is based on allowing people to drink from their private stream.

This blocking is not going to to stop me, of course I will find a way to access, it is just very annoying that they do things like this instead of providing an API or dump like Dmoz do.

I want Yahoo to make money, and would have no problem paying a reasonable fee for this access.

It's as stupid as the canutes from the RIAA, etc trying to stop music file sharing. Instead understand the realities of the technology and how people will use it, and modify the business model to accommodate that reality.

Lord Majestic

11:17 pm on Oct 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This blocking is not going to to stop me, of course I will find a way to access, it is just very annoying that they do things like this instead of providing an API or dump like Dmoz do.

I'd be very careful - they have deep pockets and lawyers on payroll who need to justify their existance - you'd be amazed how a single letter from a big company like that threatening to take you to court would change your views.

Bjorn Iceland

1:28 pm on Oct 21, 2004 (gmt 0)

10+ Year Member



I quake in my boots.

Yahoo are welcome to come to Iceland and tell it to the judge. They will find that justice lives on here.

Our legal system has not been corrupted by the influence of lawyers as it has in the United States.

I stand by all I have said.

WebFusion

2:45 pm on Oct 21, 2004 (gmt 0)

10+ Year Member



I quake in my boots.
Yahoo are welcome to come to Iceland and tell it to the judge. They will find that justice lives on here.

Our legal system has not been corrupted by the influence of lawyers as it has in the United States.

I stand by all I have said.

It always amazes me how those outside the US are so quick to insult us, yet are more than happy to pirate the goods/services of US companies to make a buck.

As far as "telling it to the judge", I doubt that wuld even be necessary. Should Yahoo truly want to block you from using their data, all they would have to do is keep blocking your I.P.

Having said that, I'm glad that those of us living in such a corrupt country can build companies that can provide you with a living.

Lord Majestic

2:48 pm on Oct 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I doubt that wuld even be necessary.

I agree - they will just perm ban his IP, and there will be no way he will be able to maintain Quality of Service for his customers - as soon as they know the IP is banned they would know that service is illegitimate, at least in eyes of Yahoo and stories of people having banned their own sites for usage of software like that would do the trick.

Bjorn Iceland

1:13 am on Oct 23, 2004 (gmt 0)

10+ Year Member



It always amazes me how those outside the US are so quick to insult us, yet are more than happy to pirate the goods/services of US companies to make a buck.

No, I have a great deal of respect for the people of the United States and the hundreds of millions of hardworking individuals there. I do not have respect for the rampant abuse of the US legal system in a way that only profits lawyers.

As far as "telling it to the judge", I doubt that wuld even be necessary. Should Yahoo truly want to block you from using their data, all they would have to do is keep blocking your I.P.

I am making the point that here we do not give into intimidation tactics from large companies.

Having said that, I'm glad that those of us living in such a corrupt country can build companies that can provide you with a living.

Icelanders as a whole may indeed have less corruption in their country than the United States (actually that is a fact according to dubious authorities like Transparency International) but you misunderstand.

I am simply offended by the idea that I should fear someone (be they a person or a company) just because they are bigger than me. That is what was implied about justice today in the US by Lord Majestic.

That actually is far more offensive to me than the idea that Yahoo can under their freedom of association rights block access to their site.

My objection with Yahoo is simply that they could be making a fee from me and thousands of others directly (as opposed to from indirect page views of their advertisements they receive now and listings from my customers) but instead waste their time blocking people like me and their shareholder's money employing high powered lawyers when engineers and marketing people would be a better investment.

Like the P2P file sharers of music files I will simply find a more indirect route which Yahoo can't block.

When Yahoo start looking at other site's Terms of Service before crawling them please let me know.

[edited by: Bjorn_Iceland at 1:22 am (utc) on Oct. 23, 2004]

Bjorn Iceland

1:20 am on Oct 23, 2004 (gmt 0)

10+ Year Member



P.S. Are you reading this Yahoo_Mike?

Would you comment about the API issue if so please. I am sure this issue must have come up before.

Lord Majestic

1:24 am on Oct 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My objection with Yahoo is simply that they could be making a fee from me and thousands of others directly

You should consider one simple thing that the kind of stuff you do is likely to be NOT in the best interests of Yahoo - your activity is likely to be viewed by Yahoo as the one that helps people to beat their anti-spam algorithms, right or wrong they can choose to err on the side of caution and just discourage any activity apart from human searching that makes them money.

Your example of parallel with P2P is incorrect - like Napster you rely on single source that you can be easily locked out from - Yahoo servers.

WebFusion

3:33 am on Oct 23, 2004 (gmt 0)

10+ Year Member



When Yahoo start looking at other site's Terms of Service before crawling them please let me know.

How about the Robots.txt standard? Least time I checked, Yahoo's crawlers fully conform to that standard. If you wna them to keep their hand off your data, you are free to use it to exclude them. Why should they not have the same right to refuse access to their data?

Icelanders as a whole may indeed have less corruption in their country than the United States

Apples and oranges, bud. That's kind of like saying a room with a dozen 5 year olds has less disciplinary problems than a room with 50. Your little country has a population of less than 300,000 people compared to our 300 million. With a larger population comes larger problems. However, nice way to slip in ANOTHER insult to my country. Keep em coming ;-)

With your reference to P2P music downloaders, I've finally realized your mindset. However, I personally believe that intellectual property is just that -property. Using someone else's property without permission or compensation is stealing, plain and simple. That (IMHO) includes data, bandwidth, software, and so on.

Having said that, I do find your sense of entitlement curious, to say the least.

Bjorn Iceland

12:50 pm on Oct 23, 2004 (gmt 0)

10+ Year Member



Apples and oranges, bud. That's kind of like saying a room with a dozen 5 year olds has less disciplinary problems than a room with 50. Your little country has a population of less than 300,000 people compared to our 300 million. With a larger population comes larger problems. However, nice way to slip in ANOTHER insult to my country. Keep em coming ;-)

I do not intend insult. It is you who are trying to deflect this conversation off topic to one of a nationalism which is not my intent.

Our cultures are based on the same principles of life, liberty and property. Vikings influenced England in the invasion of 1066 and from the same culture the people of the US took their independence from 1776 to protect their rights as Englishmen.

There are many countries as small as us who are shown as corrupt on the Transparency International scale. We are different because of our culture which we have in common with England and the US. Size is not the issue.

The reason our legal system is not corrupted by the lobbying of lawyers trying to enrich themselves and people wanting money for nothing "the compensation culture" is that we watch our politicians and other people in positions of power very closely and do not tolerate inequity and abuses of power no matter how small they might seem.

Justice is not giving in to special interests and ensuring that the weak and the powerful are both equal under the law. That is how it is here.

Regarding the US legal system, just read the implied threats of me needing to fear Yahoo "because they are big" and then read Overlawyered.com on a regular basis. I am sorry, but the people of the US have the legal system they have today because they have not kept enough watch against abuses of power.

Bjorn Iceland

12:53 pm on Oct 23, 2004 (gmt 0)

10+ Year Member



How about the Robots.txt standard? Least time I checked, Yahoo's crawlers fully conform to that standard. If you wna them to keep their hand off your data, you are free to use it to exclude them. Why should they not have the same right to refuse access to their data?

And has Yahoo! suddenly started to use robots.txt on their sites (see page 1 of this thread)?

WebFusion

1:22 pm on Oct 23, 2004 (gmt 0)

10+ Year Member



And has Yahoo! suddenly started to use robots.txt on their sites

Would that matter? If they specifically did so to "ask" your site not to access theirs, by your own admission you would try to "find a way they can't block".

At any rate, I'm done with this discussion. I've found there it's a waste of time trying to disuade people like you, as your feeling of "entitlement" to use the work of other people without compensation (much like the p2p file sharers you mentioned) leaves you with a significant blind spot as to the viabiity of your business model.

Nuff' said. Good luck with your business - you'll need it.

Bjorn Iceland

4:35 pm on Oct 23, 2004 (gmt 0)

10+ Year Member



WebFusion: With your reference to P2P music downloaders, I've finally realized your mindset. However, I personally believe that intellectual property is just that -property. Using someone else's property without permission or compensation is stealing, plain and simple. That (IMHO) includes data, bandwidth, software, and so on.

Having said that, I do find your sense of entitlement curious, to say the least.

Yes I agree with the idea that bandwidth is simple property. Intellectual property (software, data) is not as clearly cut and is not the same kind of property. But that is a debate for another time.

But agreeing as we do with the basics that property belongs to its owner, and taking this line of looking at things for a second then: i) Yahoo! steals my bandwidth when it crawls without my explicit permission (see page 1 of this thread about ignoring of robots.txt) and ii) I steal their bandwidth by my own crawling without their explicit permission.

Two wrongs do not make a right, certainly. But is Yahoo going to suddenly stop crawling and read everyone's Terms of Service if they find no robots.txt (as is the case as I have been asked to do at the start of this thread) on the site they are crawling for their index?

If no, it is the hypocrisy that is my raised issue. I cannot be held as a hypocrite if I treat them as they treat me.

We should both be held to the same standards of behaviour. Just because they are large and I am small does not excuse them.

I am entitled to treat them as they shall treat me. I shall not put up a robots.txt (like them) and I shall insist they read Terms of Service,
(if that is what they ask me to do). That is the opposite of hypocrisy.

Lord Majestic: You should consider one simple thing that the kind of stuff you do is likely to be NOT in the best interests of Yahoo - your activity is likely to be viewed by Yahoo as the one that helps people to beat their anti-spam algorithms, right or wrong they can choose to err on the side of caution and just discourage any activity apart from human searching that makes them money.

Now if you are to do this in the P2P way I allude to they will not be able to tell the difference from human searching, and there shall be no greater visible burden on their servers.

What a waste of Yahoo!'s time to write algorithms to check for this when they could provide an API and be making money from this.

Your example of parallel with P2P is incorrect - like Napster you rely on single source that you can be easily locked out from - Yahoo servers.

No, you misunderstand I was referring that like modern P2P systems the way to get around such a block is to disperse over thousands or tens of thousands of computers (with permission of those owners of those computers) with each doing a minute part. There shall be no single point of attack against someone accessing Yahoo in this manner either in detection or blocking.

So the analogy with P2P is correct.

This conversation has gone on between people on a board and we are making guesses (although of course informed guesses) about Yahoo's position on all of this except for what is in their Terms of Service.

ogletree

6:36 pm on Oct 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This has been a very interesting thread. I see what BI is saying. I also see what Y is saying. It's like the Jerry Seinfeld episode

TMI: do you want to switch over to tmi long distance
Jerry: I can't talk right now why don't you give me your home number and I will call you later
TMI: we're not allowed to do that
Jerry: I gues you don't want people calling you at home
TMI: No
Jerry: now you know how I feel

Big companies like to do to us what they don't like done to us. I heard the other day that the biggest opponents of the Do Not Call list put their name on it. They want to call you but they don't want to be called. Y has to put up with unwanted spiders just like we do. I hate it as much as they do. I don't think you should be having any sense of entitlement but I also don't hold it against you that your sticking it to the man. Don't complain it's a game. Play to win.

Lord Majestic

7:08 pm on Oct 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Y has to put up with unwanted spiders just like we do

Don't they follow robots.txt convention? If so, you can always block them out and thats the end of it, if however you choose NOT to block them out, this does not automatically mean THEY choose to block you out regardless if you let them crawl your site or not.

Bjorn Iceland

7:36 am on Oct 27, 2004 (gmt 0)

10+ Year Member



ogletree, thank you for a lucid (and amusing) illustration of the serious hypocricy I am taking issue with.

Lord Majestic, you confused me a little by your last post. As I posted in page 1 of this thread:

Please try these URLs:
[yahoo.com...]
[dir.yahoo.com...]
(http://www.searchengineworld.com/cgi-bin/robotcheck.cgi is very useful.)

...Do they in fact even bother to respect robots.txt files? Perhaps not: [webmasterworld.com...]

(My own experiments with my own SpiderAware system support this disregard for robots.txt by Yahoo's spiders.)

Yahoo! do NOT use robots.txt on their site, and it appears they also do NOT respect robots.txt on other's sites when they crawl themselves (see the above quoted link) which I will quote from here:

This is occurring for us to - a complete disregard to the robots.txt file. Yahoo! rep stated that YSlurp will obey the file, but I have weblogs that shows it just doesn't care. This is happening across multiple domains.

CaboWabo

Bjorn Iceland

3:32 pm on Nov 17, 2004 (gmt 0)

10+ Year Member



I know we are the 'little guy' but I am surprised that we have heard zero from Yahoo directly on this matter, or at least from someone who 'knows someone there'.