Forum Moderators: open

Message Too Old, No Replies

schluetersche

German bot

         

Dijkgraaf

11:46 pm on Aug 23, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



UA: Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.schluetersche.de)
Has also been seen with: Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.offis.de)

robots.txt: Yes
IP: 217.91.71.NNN

Requested 1000 pages in just over 2 hours, 3 of those malformed URL's/possible exploit attempts.

Pfui

2:55 am on Aug 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you allow the legit pages it hit in robots.txt? Or did it ask for robots.txt but ignore it? (Heritrix variants, like Nutch, can be programmed to override robots.txt.)

Dijkgraaf

4:21 am on Aug 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It didn't request any pages that were disallowed in robots.txt so far, so it is behaved in that requests.

However with some request it seems to be picking up non existent URI's from JavaScript literals where the string has a full stop in it.

GET /firstname.surname

GET /exisitingfolder/4/0l/21/0q/&.!=f30g03/'&4337/0G/UQ%5BV0_03/PD HTTP/1.0


And one where it got it from a form ( action="post") resulted in

GET /post HTTP/1.0