homepage Welcome to WebmasterWorld Guest from 107.20.37.62
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
oBot robot where art thou?
lucy24




msg:4579206
 9:12 pm on May 29, 2013 (gmt 0)

Continuing from where we left off [webmasterworld.com]:

Consider this pattern:

robot asks for and receives robots.txt
robot asks for and receives front page
robot asks for all images associated with front page-- or rather, all images that were associated with the front page at some past date, resulting in a mix of 200 and 404 (I rarely bother to 301 or 410 images)

next visit:

robot asks for and receives robots.txt
robot asks for and receives front page
robot asks for all images that were associated with the front page on that prior visit

And your point is...?

Perfectly normal robotic behavior, right? It's how an ordinary search engine, for example, constructs its shopping list.

Except that in the case of the oBot, those two visits were over sixteen months apart. So long, in fact, that somewhere along the line I must have unblocked it as part of routine housekeeping. Well, if it's only going to come by every year and a half, it probably isn't worth the bother of blocking it. Even if last year I did find plenty of good and valid reasons to lock the door.

But really. Asking for files that you put on your shopping list over a year ago seems a bit futile. Perhaps they're going for some kind of negative speed record.

:: people in a certain age bracket may insert IBM wisecrack ad lib. ::

 

keyplyr




msg:4579601
 7:35 pm on May 30, 2013 (gmt 0)

Looks like I've had this agent disallowed for about 6 years:

User-agent: oBot
Disallow: /

Can't remember why, just noted as "bad bots"

AFAIK it has never disobeyed robots.txt.

lucy24




msg:4579612
 8:09 pm on May 30, 2013 (gmt 0)

There was a follow-up visit a few hours later, with a repeat HEAD request for the specific files that were 404'd earlier. But not the files currently linked from the html page. If I leave the robot unblocked, I will expect to see it some time in 2014 asking for this year's images. (Possibly also the stylesheet. I can't remember if I had one for that specific page in 2012.)

After rereading my post from last year, I went back and looked up my record of the oBot's January 2012 visit. That time it asked for at least one image file in a directory that was definitely roboted-out at the time. :: Hasty edit as I belatedly figure out what it was doing there ::

I've noticed before now that robots may get an unearned reputation for good behavior. Nothing that links directly from my home page is in a roboted-out directory. So if robots go no deeper-- most don't-- there is no opportunity to disobey robots.txt.

keyplyr




msg:4579628
 8:56 pm on May 30, 2013 (gmt 0)


Your experience my differ, but most SE bots that hit my server just get web pages alone. They, or their sister-bots may return and get image files, but almost never get pages and images on the same visit.

So my assumption is that any bot doing both during the same visit would not be a SE bot, but have a different agenda. Whether that is beneficial to one's interest or not would vary from site to site.

That said, I tend to block almost all bots except the top several SE bots, then just allow a few others as they show a benefit for my site.

lucy24




msg:4579631
 9:16 pm on May 30, 2013 (gmt 0)

The oBot is definitely not a search engine; "different agenda" sums it up pretty well. Going by the information I dredged up last year (thread linked in first post) they seem to be involved in some pretty unappetizing filtering.

The noteworthy feature of the oBot is that its image-file requests aren't based on what is linked from the html page right now. It's what was linked from the page on its previous visit --even though that previous visit was over a year ago.

Independent of motive or agenda, that just seems inexplicable. It's like going out to buy something based on information in last year's Yellow Pages when you have the current version available.

Inevitable follow-up thought: Maybe it archives robots.txt along with the html file. So it's going by the rules of last year's robots.txt, not the current one :)

Confession: I have only now (after 16 months!) figured out why it was asking for one particular image file. It's because my front page used to use deep links to images in the assorted subdirectories; now it has duplicates in its own /images/ directory. D'oh!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved