Forum Moderators: open

Message Too Old, No Replies

What is heritrix? Who is Darwin?

Found in User Agent. Should I block these?

         

grandma genie

12:02 am on Aug 30, 2010 (gmt 0)

10+ Year Member



Hi,
Gosh, I'm finding all kinds of new stuff today. Here is one server entry:

173.192.nn.nnn - - [29/Aug/2010:19:46:42 -0400] "GET /birds/owl.html HTTP/1.0" 200 5672 "mywebsite.com/site_directory.html" "Mozilla/5.0 (compatible; heritrix/1.14.3 +http://www.accelobot.com)"

How do I block this user agent? Should I block this user agent? Any one know who this is?

And the other new one is Darwin. Here is one of the entries. This one is coming in from all over the place. Hundreds of different IPs. All going to the same image.

98.231.nn.nn - - [29/Aug/2010:19:41:42 -0400] "GET a_certain_image.jpg HTTP/1.1" 200 15295 "-" "Bing/1.2.1 CFNetwork/459 Darwin/10.0.0d3"

I know that the image is on my server and it is on the Bing Images page, so I am assuming that is why so many people are hitting on that picture, but should they all have the same User Agent string?

Grandma_genie

keyplyr

8:48 pm on Aug 30, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



grandma genie - are you using the site's SEARCH utility? There is tons of information on these UAs. Reading the archived posts here at WW is how I educated myself to be an able webmaster.

dstiles

8:49 pm on Aug 30, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There are hundreds of bots! A search on webmasterworld for each bot should get you results.

Heretrix is a general purpose scraper-type bot and normally used as such. The only legit use I've seen recently has been UK Yellow Pages, who are rubbish at bots anyway. Can't comment on darwin.

grandma genie

10:36 pm on Aug 30, 2010 (gmt 0)

10+ Year Member



I usually check the search on Webmasterworld first, but couldn't find Darwin, so figured I'd just ask about both in one question. I tried blocking heretrix in htaccess like this:

RewriteCond %{HTTP_USER_AGENT} ^heritrix [NC,OR]

but it didn't work. What is the correct way? I have blocked the IP, which did work, but I'd like to know how to do it using the correct user_agent.

I know there are many ways to deal with bots and scrapers, but since I am not lord of my server, I have to try to do what I can with my limited resources, which is pretty much robots.txt and htaccess. The Webmasterworld members are very helpful and so nice. I don't know what I would do without this forum. My aging brain makes it difficult, but I keep trying. I know I am dealing with teenage hackers from China, Romania, Russia, Korea, India, etc., but with God's help (and Webmasterworld), all things are possible. Thank you keyplyr and dstiles.

jdMorgan

10:46 pm on Aug 30, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Blocking "^heritrix" will fail if the actual user-agent (the one you see in your raw server log file does not start with "heritrix". That is, the "^" is a start-anchor, and the pattern only matches if the string starts with what is specified. Similarly, the "$" character is an end-anchor, and a pattern ending with it will only match strings that end with the specfied pattern. Include both "^" and "$" for an exact match only, and leave both out if you want a "floating match" -- to match anywhere in the string.

Also, the rule might have failed depending on what comes before and after the line you posted.

Darwin is usually associated with "CFNetwork" -- a Safari add-on/plug-in of some sort for fetching .ico icon files, PDF files and other types of multimedia files that are not handled by the native Safari code.

I block "heritrix" on sight, but allow "CFNetwork/<numbers.numbers> Darwin/<numbers.numbers>"

Jim

grandma genie

11:07 pm on Aug 30, 2010 (gmt 0)

10+ Year Member



Hi jd. I will remove the "^". I'm sure I've read somewhere about the proper way to code in htaccess, but unless I actually "do" it, I usually forget what I read. So I just have to remember when reading the user_agent string, if the agent is not the first word, leave off the anchors. On that note, do you recommend not including the anchors in general in htaccess, in order to catch all instances of these user_agents?

Thanks also for the info on Darwin.

jdMorgan

11:55 pm on Aug 30, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I recommend using the anchors whenever possible for efficiency, and omitting them if necessary due to wide/wild user-agent string variations.

There is a good regular-expressions reference cited in our Apache Forum Charter.

Jim

keyplyr

12:30 am on Aug 31, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RE: CFNetwork/459 Darwin/10.0.0d3

1.) CFNetwork (all versions) is used by Apple products (Mac computers, iPhones, Ipads, etc) as an image file download client. It shows up in the UA string only when it is being utilized, so the Mac user may browse your site's files without CFNetwork being displayed in the UA string, but when that user wants to "save" one of your image files to their machine, the CFNetwork UA will appear.

CFNetwork also displays in the UA string when favicon.ico is requested, so I allow it only in that instance and block for all other files. Here is a simple example:

RewriteCond %{HTTP_USER_AGENT} CFNetwork
RewriteRule !^favicon\.ico$ - [F]


2.) Darwin (all versions) is an indexer used by Apple products. It takes what files been queued for download and indexes them on the users machine/device. It is present in the UA string when CFNetwork is being used, but AFAIK is not a threat in itself.

jdMorgan

1:52 am on Aug 31, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Great post key... Thanks for the details on CFNetwork and especially on Darwin.

I allow CFNetwork to fetch .ico files, plus .pdf and .xls files, as these are posted on my site for download.

I also allow it to fetch apple-touch-icon.png and apple-touch-icon-precomposed.png when used with a mobile version of the Safari browser on my mobile sites.

So, as usual, each Webmaster has to decide what is appropriate for their sites, and enforce access controls accordingly. :)

Jim

keyplyr

2:06 am on Aug 31, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>>I also allow it to fetch apple-touch-icon.png and apple-touch-icon-precomposed.png when used with a mobile version of the Safari browser on my mobile sites


Ya, I do as well, but I don't offer .pdf or .xls for DL to the public. I just tried to give a simple example above. Here is what I actually have as a rewrite rule:

RewriteCond %{HTTP_USER_AGENT} CFNetwork
RewriteRule !^(apple-touch-icon\.png|apple-touch-icon-precomposed\.png|favicon\.ico|forbidden\.html)$ - [F]

Since this agent was requesting these iPhone specific files (which I didn't have) I actually created them with a little advertisement for my site - LOL

grandma genie

3:31 pm on Aug 31, 2010 (gmt 0)

10+ Year Member



Thanks jd and key, I guess I will wait and see if my mac or mobile visitors get too greedy with my pix. So far they are not. Since many of the people or bots coming to my site are looking for pictures, if they grab too much, I block them. My biggest concern is that my competitors are trying to do something to cause me to lose my placement in the search engines. I just don't know how they would do it or how to stop them.

keyplyr

7:17 pm on Aug 31, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The potential hassle of image scrapers is significant. They get your images, then display them in image directories which often include the full file path, which in turn gets used by social media, blogs, forums, et al as hot-links.

Even if you control file access by referrer, there are many who block referrer or never clear the browser's cache, so they don't know you've blocked the hot-link.

I have a large photo gallery and think hot-linking is a real hassle so I try to stop the pilfering before it happens.