homepage Welcome to WebmasterWorld Guest from 23.20.61.85
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Human or robot?
another one for the profilers
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4531038 posted 7:51 pm on Dec 25, 2012 (gmt 0)

Would have slipped right under the radar if it hadn't been for that one request for a nonexistent file. I'm inclined to think robot, but I really really don't like robots that act like humans.

Log dump:

199.190.46.141 - - [23/Dec/2012:05:05:42 -0800] "GET /directory/paston/ HTTP/1.1" 200 5134 "http://www.google.co.in/search?q=PASTON+LETTERS.pdf&hl=en {snip, snip} &start=10&sa=N" "JUC (Linux; U; 2.3.6; zh-cn; GT-B5512; 240*320) UCWEB7.9.0.94/139/355"

IP = ChinaCache, agrees with system language; I don't know what UCWeb is but it's got something to do with the IP range
search = Google India
query + startpage = correct (that is, I'm at the top of the 2nd page in google India using a different browser)
UA = who knows, but size says phone of some kind

... 05:05:42 ... /pastonstyles.css HTTP/1.1" 200 10893 "http://www.example.com/directory/paston/" ...
... 05:05:42 ... /piwik/piwik.js HTTP/1.1" 200 21928 "http://www.example.com/directory/paston/" ...
... 05:05:43 ... /images/bracket.gif HTTP/1.1" 200 490 "http://www.example.com/directory/paston/" ...
... 05:05:43 ... /images/bracket4.gif HTTP/1.1" 200 518 "http://www.example.com/directory/paston/" ...
... 05:05:43 ... /images/bracket_rt.gif HTTP/1.1" 200 489 "http://www.example.com/directory/paston/" ...
... 05:05:43 ... /images/bracket_tall.gif HTTP/1.1" 200 579 "http://www.example.com/directory/paston/" ...


Yes, the (shared) stylesheet really is bigger than the index page, though some of the size difference is due to (I guess) compression at the server end.

Q: Why did I highlight all those images?
A: Because they are not called by, or even used by, the requested file. They are background images from the stylesheet.

... 05:05:43 ... /images/signature.png HTTP/1.1" 200 3998 "http://www.example.com/directory/paston/" ...
... 05:05:43 ... /images/bracket_tall_rt.gif HTTP/1.1" 200 581 "http://www.example.com/directory/paston/" ...
... 05:05:43 ... /images/sharedtitle.png HTTP/1.1" 200 9825 "http://www.example.com/directory/paston/" ...
... 05:05:43 ... /images/sig120.png HTTP/1.1" 200 2503 "http://www.example.com/directory/paston/" ...
... 05:05:43 ... /images/dots.gif HTTP/1.1" 404 912 "http://www.example.com/directory/paston/" ...


Q: Why did they ask for a nonexistent file?
A: Because ::cough-cough:: I haven't got around to cleaning up the stylesheet. This is another background image; it happens not to be used yet-- but might show up in one of the remaining volumes in the set.

... 05:05:43 ... /images/shield.png HTTP/1.1" 200 2645 "http://www.example.com/directory/paston/" ...
... 05:05:43 ... /favicon.ico HTTP/1.1" 200 662 "-" ...


Robots never ask for the favicon-- except of course for google's faviconbot, and all those phony SEO sites.

... 05:05:43 ... /piwik/piwik.php?action_name=The%20Paston%20Letters& {snip, snip} &res=800x600 HTTP/1.1" 200 362 "http://www.example.com/directory/paston/" ...


Q: How come the res listed here doesn't match the res given in the UA?
A: I dunno, you tell me.

... 05:06:10 ... /zips/paston2.html.zip HTTP/1.1" 200 310750 "http://www.example.com/directory/paston/" ...
... 05:06:10 -0800] "GET /piwik/piwik.php?download {snip, snip} &res=800x600 HTTP/1.1" 200 362 "http://www.example.com/directory/paston/" ...


Note plausibly humanoid time lapse. Note also that they downloaded a zipped html instead of the pdf they originally searched for (and which I do have).

Where's that "noidea" smiley when you need it?

 

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4531038 posted 11:10 am on Dec 26, 2012 (gmt 0)

Did you try reading right to left ;)

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4531038 posted 8:04 am on Dec 27, 2012 (gmt 0)

Robots never ask for the favicon


That would be incorrect.

Those trying to spoof being a browser would blindly download them trying to make it look good. Also, image downloaders and copyright enforcement services download them.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4531038 posted 9:09 am on Dec 27, 2012 (gmt 0)

image downloaders and copyright enforcement services

Not a very well-educated robot, then, since we're talking about the Gairdner edition of the Paston letters :)

I did some experimenting and found that sometimes the original page will come through as referer for background images that are technically called by the style sheet-- but only if, ahem, the page in question actually uses the images. I do think leaving out a referer for the favicon was a master stroke. If only they hadn't come in from ChinaCache...

:: detour to investigate much-overdue question ::

ChinaCache is the leading provider of internet content and application delivery services in China, providing a portfolio of services and solutions to businesses,

Thank you, g###, that's all I needed to know.

:: further detour to change browsers ::

Delivery of web site content into China from foreign sources is a persistent challenge for content owners due to poor interconnection and latency at international gateways as well as unique and complex local rules and regulations. ChinaCache's web site acceleration service using caching and dynamic delivery technologies distributes the static and dynamic content of source web sites to end users in China through our widespread network of in-country CDN nodes.

Uhm...

"It's hard to access web sites from within China because webmasters on the rest of the planet tend to block the country wholesale, so we've set up a caching system using robots based outside of China to let our loyal customers view sites that are no more than a few months out of date."

Did I get that right?

Our extensive network infrastructure and relationships in China well-position us to help North American companies [or "global companies", depending on which text display you're looking at] send efficient, reliable streams into Asia.

Whether they want to or not?

MySpace has been working with ChinaCache since 2007.

Good heavens. I had no idea MySpace was around as recently as that.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved