homepage Welcome to WebmasterWorld Guest from 54.227.41.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
bebopbot
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4678867 posted 8:52 pm on Jun 10, 2014 (gmt 0)

Anyone met this guy?

Mozilla/5.0 (compatible; Linux x86_64; BebopBot/2.5.1; +http://www.apassion4jazz.net/bebopbot.html)

According to their www page, possible IPs are
50.63.211.1, 70.179.4.113, 97.74.140.17, 97.74.144.120, 173.636.184.241 [sic]
Currently it's the 70.179 one.

Why they are now blocked:
70.179.4.113 - - [07/Jun/2014:19:09:45 -0700] "GET /dirname/pagename.html HTTP/1.1" 200 8695 "http://www.webmasterworld.com/profilev4.cgi?action=view&member=lucy24" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0"
70.179.4.113 - - [07/Jun/2014:19:09:46 -0700] "GET /sharedstyles.css HTTP/1.1" 200 6346 "http://example.com/dirname/pagename.html" "{ same }"
70.179.4.113 - - [07/Jun/2014:19:09:46 -0700] "GET /fun/miststyles.css HTTP/1.1" 200 2785 "{ same }" "{ same }"
70.179.4.113 - - [07/Jun/2014:19:09:46 -0700] "GET /fun/images/fun-icon.png HTTP/1.1" 200 859 "{ same }" "{ same }"
70.179.4.113 - - [07/Jun/2014:19:09:46 -0700] "GET /fun/images/penguin.png HTTP/1.1" 200 1512 "{ same }" "{ same }"
70.179.4.113 - - [07/Jun/2014:19:09:46 -0700] "GET /fun/headers/header_beenthere.png HTTP/1.1" 200 1444 "{ same }" "{ same }"
70.179.4.113 - - [07/Jun/2014:19:09:46 -0700] "GET /fun/images/robot.png HTTP/1.1" 200 4506 "{ same }" "{ same }"
70.179.4.113 - - [07/Jun/2014:19:09:46 -0700] "GET /fun/images/panda.png HTTP/1.1" 200 2256 "{ same }" "{ same }"
70.179.4.113 - - [07/Jun/2014:19:09:46 -0700] "GET /fun/images/hummingbird.png HTTP/1.1" 200 4570 "{ same }" "{ same }"
70.179.4.113 - - [07/Jun/2014:19:09:46 -0700] "GET /fun/images/collage_robot.png HTTP/1.1" 200 149084 "{ same }" "{ same }"
70.179.4.113 - - [07/Jun/2014:19:09:48 -0700] "GET /favicon.ico HTTP/1.1" 200 661 "-" "{ same }"

Quoted in full to illustrate perfect humanoid behavior, with all supporting files except js (on this page used only by piwik). Next comes:

70.179.4.113 - - [07/Jun/2014:19:10:24 -0700] "GET /dirname/pagename.html HTTP/1.1" 200 8695 "-" "Mozilla/5.0 (compatible; Linux x86_64; BebopBot/2.5.1; +http://www.example.net/bebopbot.html)"
70.179.4.113 - - [07/Jun/2014:19:10:25 -0700] "GET /sharedstyles.css HTTP/1.1" 304 237 "-" "{ bebopbot }"
70.179.4.113 - - [07/Jun/2014:19:10:25 -0700] "GET /fun/miststyles.css HTTP/1.1" 304 237 "-" "{ bebopbot }"
70.179.4.113 - - [07/Jun/2014:19:10:25 -0700] "GET /fun/images/fun-icon.png HTTP/1.1" 304 237 "-" "{ bebopbot }"

(et cetera, as above, each with 304 and no referer) followed shortly afterward by
70.179.4.113 - - [07/Jun/2014:19:10:47 -0700] "GET /robots.txt HTTP/1.1" 200 635 "-" "{ bebopbot }"
There were a few subsequent page requests, each with accompanying css and images.

Now, my host is occasionally a bit hiccupy in logs-- but twenty-three seconds (time from first page request to first robots.txt request with this UA)? Nuh-uh.

Notice all those 304s? The robot's first page request-- the page that had previously been seen by the human(oid) UA-- came with
Cache-Control: max-age=0
The later page requests left it out. (Weird choice, btw. Search engines usually say no-cache; in fact the most common "max-age=0" is from Camino when I've explicitly refreshed a page.) I don't log headers for non-page requests, but apparently the robot wasn't as concerned with verisimilitude for those.

This annoys me.

 

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4678867 posted 4:53 pm on Jun 12, 2014 (gmt 0)



Did it crawl files disallowed by robots.txt? Did it request files too rapidly? Cause any server issues? Sorry, I don't see the problem here.

Bots will often cache robots.txt for various lengths of time, even on a different day and many times from a different IP or even using a different UA or GET tool. I see it all the time.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4678867 posted 6:43 pm on Jun 12, 2014 (gmt 0)

Sorry, I don't see the problem here.

This does not surprise me. But if we're going to play games, here are the numbers.

Requests for robots.txt from all sources in the week preceding the robot's first visit: 149, including 6 redirects
From googlebot: 12
From bingbot: 12
From msnbot-media: 50 (this explains the unnaturally low bing number, heh heh)
From Mail.RU_bot: 38
From Yandexbot: 11
From MJ12bot: 10 (poor thing! If only it would stop crawling from blocked IP ranges, it would see a lot more pages)
From Seznambot: 5
From Exabot: 3
From assorted other named robots (one or two requests each): 8

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved