homepage Welcome to WebmasterWorld Guest from 54.211.235.255
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
worio+bot+heritrix/1.10.0
worio+bot+heritrix/1.10.0
blend27




msg:3354452
 11:25 am on May 31, 2007 (gmt 0)

Mozilla/5.0+(compatible;+worio+bot+heritrix/1.10.0++http://worio.com)

Reads Robots.txt and ignores it.

got 2 pages, got it self in the trap and kept going on 403 for about 50 times.

I just donít get it. Why read Robots.txt if later not obeying it?

 

fiestagirl




msg:3360206
 3:45 pm on Jun 6, 2007 (gmt 0)

Why? To see where you keep the "good" stuff..

bouncybunny




msg:3362109
 1:45 pm on Jun 8, 2007 (gmt 0)

I'm getting

Mozilla/5.0 (compatible; worio bot heritrix/1.10.0 +http://worio.com)

from this IP 207.23.252.*

Lord Majestic




msg:3362165
 2:27 pm on Jun 8, 2007 (gmt 0)

I just donít get it. Why read Robots.txt if later not obeying it?

Weird things do happen: it is possible the bot never received that robots.txt in full, even though in your log you see the fact that it was requested. Maybe (even likely) it was a bug in robots.txt parsing, maybe even you made a mistake assuming that if Google obeys your robots.txt then that robots.txt is actually standard compliant: typically experienced bot writers will include a number of detection of typical webmaster errors to infer meaning of robots.txt directives rather than following standard to the letter.

WWW is not a perfect system where everything functions with 100% reliability - if a bot writer made attempt to read robots.txt then at least they are trying, perhaps they got a bug on their end, perhaps it is you (no one is infallible in this world): according to the link in user-agent this bot was made by "University of British Columbia's Laboratory for Computational Intelligence", hardly an evil organisation - remember that Google started in University too, give good chaps a break - contact them with your problem rather than just posting somewhere where they probably not even look: the problem might actually lie in "heritrix" codebase used by Internet Archive - they open sourced it.

I have no relation to the bot, just felt it was worthwhile answering the question based on my fairly extensive experience in crawling the Weird World Web :)

[edited by: Lord_Majestic at 2:28 pm (utc) on June 8, 2007]

blend27




msg:3364543
 5:53 pm on Jun 11, 2007 (gmt 0)

--never received that robots.txt in full

maybe

---it was a bug in robots.txt parsing

maybe

-- perhaps they got a bug on their end

maybe

-- contact them with your problem

i don't have one

---give good chaps a break---

I do, but not on site that clearly states

User-agent : *
Disallow: /trap

when something gets in to the /trap I whack the good and bad guys on the same level, whether it's Google, Yahoo(in fact with in last 5 minutes), WORIO or any other yo-yo,

The WORIO spider showed tipical scraper behaviour based on my script that tracks them and got blocked.

BTW, I've seen it doing so on more than one site, just simply stating the fact

don't get me wrong I love all of them.

:)

volatilegx




msg:3365968
 2:25 am on Jun 13, 2007 (gmt 0)

---give good chaps a break---

I do, but not on site that clearly states

User-agent : *
Disallow: /trap

Bad syntax, there. I hope that's not a copy from your robots.txt. You should remove the space between "User-agent" and ":".

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved