Forum Moderators: goodroi
The info page says it's a distributed crawler, so just like my policy for the cronic robots.txt violater Grub, I banned the user agent and the entire IP block associated with the offending IP.
I asked for robots.txt in the 2nd post in this thread, I fail to see any sarcasm in it
The sarcasm and insults came in later posts.
You told me off
Told you off? In post #3? I declined to post the robots.txt and instead offered to answer any questions you might have about it. I also clarified webmasterworld's "new user" status of my username in case that would help in any way.
clearly my bot does not just ignore robots.txt
I never said that it "just" ignored robots.txt. I said that on one of my sites it violated robots.txt.
and since I know for fact that it works in principle I can't just simulate a case that I have no idea about.
I'm not disputing that. The dispute here seems to be about what information was offered.
don't expect anything to happen with your "report" as it lacks credibility
I don't expect you to do anything with my report since you declined any further information that I offered.
due to lack of data minimally necessary to verify it.
That's debatable. You rejected my offer to supply more information, so there's no way to know if it would have been enough to verify my claim.
Anyway, I think I am going to publish code that does the job (checks if URL should not be retrieved for a given robots.txt) so that anybody who questions MJ12bot's support for robots.txt can see for themselves.
Good luck. Earlier in this thread, I even said it's possible that you've corrected whatever bug caused the violation in my logs. Unfortunately, we'll never know where the problem was because you rejected my offer of help. And you said you already corrected a bug that caused a violation of someone else's robots.txt. Maybe that was actually the same bug that affected my site. At this point it's just too late for your bot on sites that administer.
At this point it's just too late for your bot on sites that administer.
But its not too late for other people's sites. And if the purpose of your report is to help others then the first port of call is the person who created the bot, ie me, but since you refused to provide minimum information necessary to trace the problem (and you clearly never contacted me by email shown on page in bots's useragent as all such emails were resolved to mutual satisfaction of all sides) I have no choice but to rest my case: anyone who cares about bug being fixed would provide information necessary to do so, certainly if that information is publicly available.
My robots.txt function requires two pieces of data: robots.txt and URLs to check, if this is not present then I can't check it, end of story.
but since you refused to provide minimum information necessary to trace the problem
That point has already been raised and countered multiple times in this thread.
and you clearly never contacted me by email
Just another variation of a point that has already been raised and countered multiple times in this thread.
anyone who cares about bug being fixed would...
That point has already been raised and countered multiple times in this thread. You rejected my offer to provide you with more information.
My robots.txt function requires...
That point has already been raised and countered multiple times in this thread.
end of story.
That would be nice as it seems the same things keep getting repeated over and over.
User-agent: *
Disallow: /ftr/
Disallow: /support/
Disallow: /thespot/
Disallow: /mailyou/
See those last two entries? Spider traps. Your crawler hit both of them. I believe an apology is due to at least one other member in this thread.
Additionally, I suggest taking this crawler off-line until it actually does comply -- otherwise you're going to find it in many mod_rewrite rulesets.
To save bandwidth the Boitho robot caches the robots.txt file locally. Sow it is probably not showing in your log near the other entries because we have checked it some days ago.
Also one crawler may fetshs the robots.txt file, an another the html page.
Can you post or pm me the url sow I can investigate.
I'd suggest you also take note of the "Expires" header of robots.txt if it is provided by the server, and mark that domain's cached robots.txt entry to expire if you would normally cache it beyond the expiry time.
My robots.txt is set to expire four hours after it is fetched. If boitho caches it for 24 hours, there is a chance that boitho will attempt to fetch a disallowed resource, and that would lead to a 403-Forbidden from my server.
During updates, I upload the new robots.txt in the morning, and change the files and access restriction checking in the late afternoon. Therefore, a four-hour expiry should prevent any robot that comprehends the expiry time from using a stale robots.txt. Due to production schedules, I cannot wait 24 hours after I receive notice of changes, so four hours is a practical limit.
Jim
Not to change the subject (from boitho) but I at least found some sort of search engine there.
Where is the MJ12 search engine? I was all over their MJ12 site, found distributed crawling,
complex blogs, all sorts of stuff .. everything EXCEPT a stinkin' search engine!
Whats the point of all that crawling if there is no search engine page? -Larry
User-agent: *
Disallow: /ftr/
Disallow: /support/
Disallow: /thespot/
Disallow: /mailyou/
See those last two entries? Spider traps. Your crawler hit both of them. "
- - - -
I love it! Thanks for the trick.
NOW I understand some strange entries in my logs. I had the following in my robots.txt:
User-agent: *
Disallow: /airheads/ (I have a UFO site. Lots of airheads in UFO-land.)
I found a few user agents trying to access the non-existent subdirectory /airheads,
and didn't know what to make of it! Now I do. Thanks again, and back to Boitho. -Larry
It's a little hard to follow this thread with all the flaming going on. I don't think Boitho (my opinion of the posts from someone who works on the bot) is malicious but if it's established that it disobbey's robots.txt i'll ban it.
What about the other one? The one the whole flame war was all about... did Jazzguy end up proving the violation of robots.txt or was it all just a waste of content-building time reading it all?
thank you for your reply.
I searched thru my server's logs, but could not find any reference to "boitho" trying to look at my /robots.txt for a more than 4 month period before the 2005-05-06:
grep -i boitho access_log.2005* ¦ grep robot
--> no results
So the content of my /robots.txt does not seem to matter much.
However, here it is. It is very short, as it contains basically just the trap disallow (trap directory renamed) and one other statement:
----------------------------
User-agent: *
Disallow: /bottrap/
User-agent: ia_archiver
Disallow: /
----------------------------
This file is in place unchanged since 2002-11-02. Other bots like googlebot, msnbot, inktomi yahoo slurp, etc, as well as the ia_archiver, don't have problems with it.
If your bot uses another ident that "boitho" when fetching the /robots.txt, pls let me know, and I will grep thru my logs again.
Regards,
R.
Where is the MJ12 search engine? I was all over their MJ12 site, found distributed crawling,
complex blogs, all sorts of stuff .. everything EXCEPT a stinkin' search engine!
I can't post URL here, but if you check our site now you will see there a link to Alpha version of the search engine: it was (alpha)released on 3/06/06.
It is still work in progress and the plan is to release weekly (or so) updates. I expect to see all crawled data indexed and available for searching by the end of summer.
I think that is very inappropriate behavior. I count at least Lord Majestic 6 - Jazzguy 0.
<snip></snip>
[edited by: ThomasB at 8:43 pm (utc) on June 14, 2005]
[edit reason] removed noise [/edit]
...the bot owner once again confirmed that his bot has violated robots.txt before, so there you go.
So far 2 bug reports were correct, and about 8 more were not.
Thus there is about 80% chance that a bug report is not correct. Which does not mean that all bug reports should be ignored, however if a bug report has no sufficient information to replicate alleged bug then there is nothing I can do.
Rejecting offers to supply more information is another easy way to dismiss reported bugs.
The information that matters is the robots.txt file and URLs to check that you refused to provide.
Its like saying that a browser is not rendering page correctly and then refusing to provide HTML to verify bug report.
So you finally supplied him with the only info that would matter
<Sigh> As has been covered already in this thread, there's no way to know if the information I offered would have mattered because the offer itself was rejected -- no information was ever examined.
the guy never supplied the robots.txt
Already covered multiple times in this thread.
instead he offered him his advise as consultant :P
Your sarcastic remark highlights a big part of what caused this thread to degenerate so quickly, i.e., assuming that a report is inaccurate or that a reporter is a rookie.
Its a rather safe guess
More assumptions.
that it would have taken you less time to re-type the whole of your robots.txt manually using small finger on the left hand than to post as much as you did in this thread.
Okay, now you're just trolling. Supplying the robots.txt has already been covered multiple times in this thread. Do you really want to spend another three pages saying the same things over and over again?
A typical rookie mistake is not providing the necessary data to fix a problem
I always ask for specific information that helps track bug, and asking for robots.txt is pretty much given, so much that I never came across with someone actually refusing to show it -- jazzman is the first, and hopefully last.
Its as if the guy does not want his problem fixed, or the problem did not exist in the first place. Note that he started this thread accusing boitho spider of violating robots.txt, but then quickly switched to bash my bot, in both cases not providing any details that can reasonably help investigate the problem.