Welcome to WebmasterWorld Guest from 54.224.83.221

Forum Moderators: Ocean10000 & keyplyr

PaperLiBot

     
10:06 pm on Jul 12, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14905
votes: 649


By no means a new UA, but with a twist.

Old familiar version:
37.187.162.abc - - [09/Jul/2018:13:36:56 -0700] "GET /ebooks/shropshire/ HTTP/1.1" 403 1838 "-" "Mozilla/5.0 (compatible; PaperLiBot/2.1; http://support.paper.li/entries/20023257-what-is-paper-li)"
New version:
37.187.162.abc - - [10/Jul/2018:15:49:44 -0700] "GET /ebooks/shropshire/ HTTP/1.1" 403 1838 "-" "Mozilla/5.0 (compatible; PaperLiBot/2.1; https://support.paper.li/entries/20023257-what-is-paper-li)"
Awright, keyplyr, letís see how quickly you can spot the difference.

I spent several minutes staring at my processed robot logs trying to figure out why it retained the UA string when Iíve coded it to omit anything thatís always the same.
10:47 pm on July 12, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12083
votes: 770


They changed the URL to secure protocol (don't compare timestamps... I was in the shower)

I get a trickle of traffic from people who have posted my site & citation at PagerLi, but almost not enough to justify the activity they sometimes create at my server. If you're big at Social Media, you'll see a lot of activity from this UA as their users will repost from other SM sites.

Related: [webmasterworld.com...]
11:34 pm on July 12, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14905
votes: 649


As soon as I pinpointed the change, I thought of you :)
11:42 pm on July 12, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12083
votes: 770


How sweet ...then you immediately thought of how best to torment me by asking me to go through the very thing that had you puzzled.
1:15 am on July 13, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14905
votes: 649


Well, I was thinking it illustrates one of your favorite points: even unwanted (in my case) robots are going https :)

:: detour to check something ::

a content curation service that let's you turn socially shared content
Nope, still havenít fixed the grocerís apoístrophe.
3:18 am on July 13, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12083
votes: 770


Did you discover the UA change because an existing rule didn't work with the new UA? Do you use the entire UA string in your rules, e.g. the bot info URL?
4:00 am on July 13, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14905
votes: 649


My preliminary log-wrangling pulls out a long list of known robots that I currently keep track of, and then each one gets dumped into its own logs. In general, I only track the parts that change: for example if a bot always used the identical IP, it's noted once and for all rather than each time; if it's always the identical UA string, it isn't recorded each time; I don't care about the method unless the same robot does both HEAD and GET, and so on.

I discovered the UA change because it happened in the middle of a logging period--currently 2x weekly--literally between one day and the next. If instead it had happened to change between one logging period and the next, each set would have matched internally and I might never have noticed. (This has been known to happen. Then I have to go back through saved logs and pinpoint when it changed from Old UA to New UA, or from Old IP to New IP.)

For this specific robot, I expected to get an output listing
--IP address (this robot uses a couple of different ones, so I tell it to keep track even if they're all the same in a given week)
--requested page
--status of request (I've found it helps me to note this each time, even if some robot is always 403 or always 302)
So it was confusing to also get a list of UAs; at first I thought I'd made a mistake when cutting-and-pasting. It's triggered by a function that compares all values for a given category to see if they're identical.

Obviously the computer is better than the human brain at detecting that the string "http" != the string "https".
4:14 am on July 13, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12083
votes: 770


I just have a list of whatever I'm currently keeping track of in a bat file with grep commands.

If I don't see a return on what I think should be there, I'll manually reduce the value by one character until I do get a return (or not.) That sometimes discovers changes.

I also manually look through the error log and at every single 403 (and sometimes 50*) in the server log.

All in all, a100mg log file usually takes about an hour to look through and look-up info.

I'll do this twice a day usually.
6:07 pm on July 13, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14905
votes: 649


If my logs were 100MB I would not be able to process them in javascript :) Even one meg is unusual (as in the blizzard of Barsetshire mobile-facebook referers I was talking about in another thread).

In the specific case of PaperLiBot, that's the string I search for. It's exceedingly rare to have to search for an exact, beginning-to-end, litteratim UA string from the first Mozilla to the last /about-our-robot link.