homepage Welcome to WebmasterWorld Guest from 174.129.80.166
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
meanpath
lucy24




msg:4610820
 12:38 am on Sep 19, 2013 (gmt 0)

Anyone know anything about meanpath dot com and/or the meanpathbot? Forums search is entirely silent :(

Its most recent crawl was from an from OVH range, meaning it was a priori blocked except for robots.txt. It did ask -- but it went on to ask for the front page of my test site, which is 100% roboted-out. As in:

"Which part of
User-Agent: *
Disallow: /
did you not understand?"

I don't know if it would try to dig deeper, if permitted. Equally important, I can't tell if it's serving any good and worthy purpose. This is assuming for the sake of discussion that a legitimate new search engine is a "good and worthy" thing. ymmv on this point. But here I'm not sure of its legitimacy in the first place.

 

bhukkel




msg:4610861
 5:26 am on Sep 19, 2013 (gmt 0)

On their blog i see postings like 'Twitter Bootstrap Now Powering 1% of The Web'. So it looks likes a service as builtwith.

adamseabrook




msg:4611118
 12:24 am on Sep 20, 2013 (gmt 0)

Hi Lucy24,

I am the CEO of meanpath, Inc. meanpathbot should respect your robots.txt so if you can email me the domain of the site it tried to crawl without respecting your robots.txt I can get the team to look into it. adam@meanpath.com or support@meanpath.com

One common error we see with robots.txt is that web masters are not aware robots.txt will be read from top to bottom with the first applicable rule found followed. So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.

This may not be the case here but we can easily figure out what the issue is once we do a test on your site.

not2easy




msg:4611155
 4:00 am on Sep 20, 2013 (gmt 0)

It has been crawling around a few of my sites since late June or so. Not to my benefit, they have been uninvited.

JD_Toims




msg:4611366
 5:05 pm on Sep 20, 2013 (gmt 0)

So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.

Why would you not follow the most specific directive for your bot [like at least Google and Bing do] rather than simply using the first found since a "generic" directive is likely the first directive in the file?

keyplyr




msg:4611413
 8:11 pm on Sep 20, 2013 (gmt 0)




So if you had say a permission for all crawlers to crawl at the top and a specific one for meanpathbot saying no below it would follow the allow at the top not the disallow at the bottom.

That statement just got that UA and it's respective IP range blocked across the sites I manage.

JD_Toims




msg:4611419
 8:24 pm on Sep 20, 2013 (gmt 0)

Oh, yeah, I didn't mention I snap-blocked them too after I read that.

adamseabrook




msg:4611441
 9:17 pm on Sep 20, 2013 (gmt 0)

Sorry my response had a typo in it and was not well worded. What I was trying to say was if you have two directives specific to meanpathbot with conflicting disallow and allow statements we take the first one found. We find multiple conflicting directives often especially on sites which have dynamic bot blockers that add things to robots.txt on the fly. Meanpathbot should act the same as he Googlebot so if you see it acting differently let us know so we can work out why.

User agent * allow is actually redundant as all bots will assume they are allowed unless there is a specific rule for them.

https://github.com/meanpath/robots our robots code is actually open source and in use by a few crawlers.

lucy24




msg:4611449
 9:37 pm on Sep 20, 2013 (gmt 0)

Overlapping again! In response to keyplyr and jd:

I read yesterday's response, said WHAT THE ###, composed a long reply... and deleted it on the grounds of non-responsiveness.

My robots.txt files do happen to list * as the very last record. But that's coding style, not the robots.txt standard.

Well, it's all academic since we're talking about an OVH range.

keyplyr




msg:4611537
 6:42 am on Sep 21, 2013 (gmt 0)





Well, it's all academic since we're talking about an OVH range.

True... but I couldn't resist being dramatic :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved