Welcome to WebmasterWorld Guest from 54.196.190.32

Forum Moderators: Ocean10000 & keyplyr

Message Too Old, No Replies

glindahl-cocrawler

     
8:13 pm on Jan 11, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 890



UA: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:48.0) Gecko/20100101 Firefox/48.0 glindahl-cocrawler/0.1.5.dev281+g8db2f00.d20180110 (+http://www.pbm.com/~lindahl/glindahl-cocrawler.html)
Protocol: HTTP/1.1
Robots.txt: Yes
Host: svcolo.com
64.13.128.0 - 64.13.191.255
64.13.128.0/18
9:51 pm on Jan 11, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15253
votes: 691


Good heavens. I used to know a G. Lindahl online. Wonder if it's him? :)
10:27 pm on Jan 11, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 890


Was it this guy? [pbm.com]
3:19 am on Jan 12, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15253
votes: 691


Hey, it is him. I looked up his profile in the venue where I used to know him--from which he seems to have departed almost as long ago as I did--and it's got a link to that selfsame pbm dot com. And a photograph. How funny. (Now, what's really funny is how long I spent poring over photographs and biographical details before realizing that the pbm was staring me in the face all along. Never mind!)

Well, just for that I've poked a preemptive hole for the robot. It came by yesterday (the 10th), asked for robots.txt, got redirected* (wrong www, I guess), asked again in the right place, and only then asked for the front page--which was denied on header grounds. A lot of robots who get redirected when requesting robots.txt then go ahead and ask for the front page--at that same wrong hostname--before they get around to following-up the robots.txt redirect. Harrumph.


* On my personal site, which is https, robots.txt is exempt from redirection because some respectable robots seemed confused by the change. My “real” site has no such exemption, since it’s only a matter of with/without www and this doesn't seem to bother the robots.
5:29 pm on Jan 13, 2018 (gmt 0)

New User

5+ Year Member

joined:Nov 20, 2011
posts: 5
votes: 0


Yes, it's me. My crawler is supposed to follow up to 5 robots.txt redirects first before asking for any pages. If you send me a bug report with more details I'll be happy to look into it -- brand new crawler, I haven't gotten any feedback on it from anyone external, yet.
6:19 pm on Jan 13, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15253
votes: 691


Heh. Nice to see that you finally got Queen Anna's New World of Words, or whatever the heck it was called, duly posted.

My crawler is supposed to follow up to 5 robots.txt redirects first before asking for any pages.
Ooh. Goooood robot
:: patting robot approvingly on the head ::
7:15 pm on Jan 13, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 890


I haven't gotten any feedback on it from anyone external, yet
lucy24 is external.

So GregLindahl, what do you do with the data from our websites?
8:03 pm on Jan 13, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15253
votes: 691


Well, I can't say anything about the robot's behavior yet, because on the first visit it was blocked. I don't have one of those fancy robots.txt programs that looks at the visitor's name, checks whether it is authorized, and if not, generates a Disallow: line on the fly. So first-time visitors will not find a Disallow, except for specified directories; instead they'll be physically barred on header grounds unless and until I poke a hole.

Then again I may not be the most useful feedback-supplier, since I generally don't much care what a robot plans to do with the information it finds, unless it's a blatant top-to-bottom full-site scraper.
7:00 am on Jan 16, 2018 (gmt 0)

New User

5+ Year Member

joined:Nov 20, 2011
posts: 5
votes: 0


I'm building a search engine.
7:08 am on Jan 16, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 890


Wow... good luck with that. Let us know when you have an index up.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members