Ask.com spider with blank user agent

Forum Moderators: open

Message Too Old, No Replies

Ask.com spider with blank user agent

Lain_se

3:01 pm on Feb 10, 2010 (gmt 0)

About once per week I get a robots.txt request from the same IP which resolves to IAC Search Media Inc also known as Ask.com. In my logs the user agent is always blank, but the rest of the site is indexed from the same IP and contains a full user agent. I block blank user agents from requesting the robots file, which does not seem to matter in my ranking which is actually very high for Ask.com for my industry.

Why would this spider use a blank user agent to request robots but then have one while indexing?

keyplyr

9:23 pm on Feb 10, 2010 (gmt 0)

Although I do block requests with a blank UA, I allow all to request robots.txt. Some crawlers/bots/spiders will use a separate utility often with a blank UA to request robots.txt. Some also do HEAD requests with a blank referrer.

dstiles

10:06 pm on Feb 10, 2010 (gmt 0)

Yahoo often hits pages using blank headers. I block it - doesn't seem to matter to them. Haven't seen any Ask ones at all (I only log pages, not robots.txt).

jdMorgan

11:07 pm on Feb 10, 2010 (gmt 0)

Do not block any user-agent from accessing your robots.txt file and do not block *any* requests for your custom 403 error page.

Doing so can effectively subject your site to unwanted spidering and a "self-inflicted denial-of-service attack." Some primitive spiders will interpret the lack of a 200-OK response with a valid robots.txt file to mean that they have carte-blanche to spider the whole site, while blocking access to your custom 403 error page means that any request that is denied will result in a cascade of 403-Forbidden errors, as the blocked agent will also be forbidden from fetching your custom 403 error page -- and will get another 403 when the server tries to serve it... and another 403 in response to that second 403, and yet another 403 in response to that...

Jim

Lain_se

6:52 pm on Feb 11, 2010 (gmt 0)

Thanks guys.

I had been blocking a large number of user agents or lack of them for my site due to well documented abuse and known scraper tools. My site is located in a sub-folder and the root directory contain my robots file along with some other things such as bot traps, my project honeypot trap and my proxy detector script.

I just thought this is pretty stupid of a major search engine to use a blank user-agent for requesting this file but than again it may explain why they have such little market share of the search industry?

Now if only I could get that jerk who runs that dumb-bot called DotBot to go away. :( You would think after eating 403's for the last 3 months he would stop? It requests the robots file and falls into my bot trap then continues to keep eating 403's all day long every day.