Msg#: 4610818 posted 12:38 am on Sep 19, 2013 (gmt 0)
Anyone know anything about meanpath dot com and/or the meanpathbot? Forums search is entirely silent :(
Its most recent crawl was from an from OVH range, meaning it was a priori blocked except for robots.txt. It did ask -- but it went on to ask for the front page of my test site, which is 100% roboted-out. As in:
"Which part of User-Agent: * Disallow: / did you not understand?"
I don't know if it would try to dig deeper, if permitted. Equally important, I can't tell if it's serving any good and worthy purpose. This is assuming for the sake of discussion that a legitimate new search engine is a "good and worthy" thing. ymmv on this point. But here I'm not sure of its legitimacy in the first place.
Msg#: 4610818 posted 12:24 am on Sep 20, 2013 (gmt 0)
I am the CEO of meanpath, Inc. meanpathbot should respect your robots.txt so if you can email me the domain of the site it tried to crawl without respecting your robots.txt I can get the team to look into it. email@example.com or firstname.lastname@example.org
One common error we see with robots.txt is that web masters are not aware robots.txt will be read from top to bottom with the first applicable rule found followed. So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.
This may not be the case here but we can easily figure out what the issue is once we do a test on your site.
So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.
Why would you not follow the most specific directive for your bot [like at least Google and Bing do] rather than simply using the first found since a "generic" directive is likely the first directive in the file?
Sorry my response had a typo in it and was not well worded. What I was trying to say was if you have two directives specific to meanpathbot with conflicting disallow and allow statements we take the first one found. We find multiple conflicting directives often especially on sites which have dynamic bot blockers that add things to robots.txt on the fly. Meanpathbot should act the same as he Googlebot so if you see it acting differently let us know so we can work out why.
User agent * allow is actually redundant as all bots will assume they are allowed unless there is a specific rule for them.
https://github.com/meanpath/robots our robots code is actually open source and in use by a few crawlers.