Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- Naughty Yahoo User Agents


jdMorgan - 6:08 am on Jun 13, 2006 (gmt 0)


If *all* the webmasters who read here at WebmasterWorld banned *all* Yahoo user-agents...
Yahoo probably wouldn't notice.

That 'attention-getting' tactic just isn't likely to work. I'm willing to accept that Yahoo! and all the other major search engines make a good-faith effort to comply with robots.txt, but that coding errors, bugs, database disconnects, and misunderstandings of the 'protocol' do happen.

The only reason I ban any major 'bot from any page or cloak any page is to keep that page out of the index. And the only reasons I do that are:

  • Dangerous content: The PRC has severe penalties against things that are perfectly legal to do or discuss in the U.S. I don't want some poor schmuck going to prison for stumbling onto my site.
  • Controlling entry click-paths and context: Some pages make little sense if you miss the previous page.
  • Limited-distribution information: If you're on the site, and reading at depth, here's the info and the graphics. If you're scraping for contact info or membernames or multimedia using search, they're not there.
  • Bot traps: Trying to keep the majors from banning themselves, especially when they fire up a new IP adress block.
  • Others that slip my mind right now.

    Bottom line is that I'm a realist and a pragmatist; This is business. So I don't ban anybody out of malice or spite. I just decide if I need their traffic or not, and if not, 403. If Yahoo! were to publish a statement that they intended to disregard robots.txt in the future, I still wouldn't ban them. But they'd be seeing a heckuva lot more in the Vary: User-agent class... ;)

    I posted the exact structure of robots.txt that Slurp China is choking on above, with the URLs obscured to comply with the WebmasterWorld TOS and my own desire for privacy. But other than those changes, the example is a letter-perfect rendition of my actual code. I think Yahoo! can easily test it themselves, if they're so inclined.

    Also, the problem is in parsing User-agent names, most likely. Anybody could do a 'less risky' test by disallowing just a single URL-path to Slurp China if they wanted to. I suspect they'd see the same failure I did.

    Jim


    Thread source:: http://www.webmasterworld.com/search_engine_spiders/3276.htm
    Brought to you by WebmasterWorld: http://www.webmasterworld.com