Welcome to WebmasterWorld Guest from 54.161.201.189

Forum Moderators: Ocean10000 & incrediBILL

meanpath

   
12:38 am on Sep 19, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Anyone know anything about meanpath dot com and/or the meanpathbot? Forums search is entirely silent :(

Its most recent crawl was from an from OVH range, meaning it was a priori blocked except for robots.txt. It did ask -- but it went on to ask for the front page of my test site, which is 100% roboted-out. As in:

"Which part of
User-Agent: *
Disallow: /
did you not understand?"

I don't know if it would try to dig deeper, if permitted. Equally important, I can't tell if it's serving any good and worthy purpose. This is assuming for the sake of discussion that a legitimate new search engine is a "good and worthy" thing. ymmv on this point. But here I'm not sure of its legitimacy in the first place.
5:26 am on Sep 19, 2013 (gmt 0)



On their blog i see postings like 'Twitter Bootstrap Now Powering 1% of The Web'. So it looks likes a service as builtwith.
12:24 am on Sep 20, 2013 (gmt 0)



Hi Lucy24,

I am the CEO of meanpath, Inc. meanpathbot should respect your robots.txt so if you can email me the domain of the site it tried to crawl without respecting your robots.txt I can get the team to look into it. adam@meanpath.com or support@meanpath.com

One common error we see with robots.txt is that web masters are not aware robots.txt will be read from top to bottom with the first applicable rule found followed. So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.

This may not be the case here but we can easily figure out what the issue is once we do a test on your site.
4:00 am on Sep 20, 2013 (gmt 0)

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



It has been crawling around a few of my sites since late June or so. Not to my benefit, they have been uninvited.
5:05 pm on Sep 20, 2013 (gmt 0)

WebmasterWorld Senior Member Top Contributors Of The Month



So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.

Why would you not follow the most specific directive for your bot [like at least Google and Bing do] rather than simply using the first found since a "generic" directive is likely the first directive in the file?
8:11 pm on Sep 20, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month






So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.

That statement just got that UA and it's respective IP range blocked across the sites I manage.
8:24 pm on Sep 20, 2013 (gmt 0)

WebmasterWorld Senior Member Top Contributors Of The Month



Oh, yeah, I didn't mention I snap-blocked them too after I read that.
9:17 pm on Sep 20, 2013 (gmt 0)



Sorry my response had a typo in it and was not well worded. What I was trying to say was if you have two directives specific to meanpathbot with conflicting disallow and allow statements we take the first one found. We find multiple conflicting directives often especially on sites which have dynamic bot blockers that add things to robots.txt on the fly. Meanpathbot should act the same as he Googlebot so if you see it acting differently let us know so we can work out why.

User agent * allow is actually redundant as all bots will assume they are allowed unless there is a specific rule for them.

[github.com...] our robots code is actually open source and in use by a few crawlers.
9:37 pm on Sep 20, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Overlapping again! In response to keyplyr and jd:

I read yesterday's response, said WHAT THE ###, composed a long reply... and deleted it on the grounds of non-responsiveness.

My robots.txt files do happen to list * as the very last record. But that's coding style, not the robots.txt standard.

Well, it's all academic since we're talking about an OVH range.
6:42 am on Sep 21, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month







Well, it's all academic since we're talking about an OVH range.

True... but I couldn't resist being dramatic :)
 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month