Welcome to WebmasterWorld Guest from 23.22.182.29

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

meanpath

     
12:38 am on Sep 19, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12702
votes: 244


Anyone know anything about meanpath dot com and/or the meanpathbot? Forums search is entirely silent :(

Its most recent crawl was from an from OVH range, meaning it was a priori blocked except for robots.txt. It did ask -- but it went on to ask for the front page of my test site, which is 100% roboted-out. As in:

"Which part of
User-Agent: *
Disallow: /
did you not understand?"

I don't know if it would try to dig deeper, if permitted. Equally important, I can't tell if it's serving any good and worthy purpose. This is assuming for the sake of discussion that a legitimate new search engine is a "good and worthy" thing. ymmv on this point. But here I'm not sure of its legitimacy in the first place.
5:26 am on Sept 19, 2013 (gmt 0)

Full Member

5+ Year Member

joined:Aug 16, 2010
posts:214
votes: 11


On their blog i see postings like 'Twitter Bootstrap Now Powering 1% of The Web'. So it looks likes a service as builtwith.
12:24 am on Sept 20, 2013 (gmt 0)

New User

joined:Sept 20, 2013
posts: 2
votes: 0


Hi Lucy24,

I am the CEO of meanpath, Inc. meanpathbot should respect your robots.txt so if you can email me the domain of the site it tried to crawl without respecting your robots.txt I can get the team to look into it. adam@meanpath.com or support@meanpath.com

One common error we see with robots.txt is that web masters are not aware robots.txt will be read from top to bottom with the first applicable rule found followed. So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.

This may not be the case here but we can easily figure out what the issue is once we do a test on your site.
4:00 am on Sept 20, 2013 (gmt 0)

Moderator from US 

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:2563
votes: 48


It has been crawling around a few of my sites since late June or so. Not to my benefit, they have been uninvited.
5:05 pm on Sept 20, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:July 19, 2013
posts:1097
votes: 0


So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.

Why would you not follow the most specific directive for your bot [like at least Google and Bing do] rather than simply using the first found since a "generic" directive is likely the first directive in the file?
8:11 pm on Sept 20, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5805
votes: 64





So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.

That statement just got that UA and it's respective IP range blocked across the sites I manage.
8:24 pm on Sept 20, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:July 19, 2013
posts:1097
votes: 0


Oh, yeah, I didn't mention I snap-blocked them too after I read that.
9:17 pm on Sept 20, 2013 (gmt 0)

New User

joined:Sept 20, 2013
posts: 2
votes: 0


Sorry my response had a typo in it and was not well worded. What I was trying to say was if you have two directives specific to meanpathbot with conflicting disallow and allow statements we take the first one found. We find multiple conflicting directives often especially on sites which have dynamic bot blockers that add things to robots.txt on the fly. Meanpathbot should act the same as he Googlebot so if you see it acting differently let us know so we can work out why.

User agent * allow is actually redundant as all bots will assume they are allowed unless there is a specific rule for them.

[github.com...] our robots code is actually open source and in use by a few crawlers.
9:37 pm on Sept 20, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12702
votes: 244


Overlapping again! In response to keyplyr and jd:

I read yesterday's response, said WHAT THE ###, composed a long reply... and deleted it on the grounds of non-responsiveness.

My robots.txt files do happen to list * as the very last record. But that's coding style, not the robots.txt standard.

Well, it's all academic since we're talking about an OVH range.
6:42 am on Sept 21, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5805
votes: 64






Well, it's all academic since we're talking about an OVH range.

True... but I couldn't resist being dramatic :)