Crawler from an Ask IP requested sitemap.xml

Forum Moderators: open

Message Too Old, No Replies

Crawler from an Ask IP requested sitemap.xml

No referrer and no identity in UA

Mokita

11:20 pm on Nov 17, 2006 (gmt 0)

As it happens this particular site does have a (Google) sitemap.xml, but also has Mod_rewrite disallowing all Java crawlers except Google, so it got a 403.

If Ask wish to utilise Sitemaps, surely they should do so openly, not by stealth?

65.119.214.9 - - [18/Nov/2006:09:30:19 +1100] "GET /sitemap.xml HTTP/1.1" 403 - "-" "Java/1.5.0_07"

Anyone else seen it or have an opinion about this?

incrediBILL

9:01 pm on Nov 18, 2006 (gmt 0)

I think you're overreacting as Ask is probably just running a prototype trying to catch up with the recent industry-wide sitemap adoption [webmasterworld.com] so this probably is nothing to be concerned with.

Mokita

11:00 pm on Nov 18, 2006 (gmt 0)

As far as I know, Google and Yahoo only request sitemap.xml once they have been invited to do so by the site owner submitting it. And when they do access it, they use a readily identifiable UA.

My gripe is that I didn't submit it to Ask, plus they are hiding behind a generic UA.

incrediBILL

1:41 am on Nov 19, 2006 (gmt 0)

Do you have ASK allowed in your robots.txt?

If so, I think they did nothing wrong.

If not, I'll bring the lynch mob, you supply the beer.

wilderness

1:55 am on Nov 19, 2006 (gmt 0)

If not, I'll bring the lynch mob, you supply the beer.

One of you will also need to add in tranportation fees for the mob between North America and Australia vice versa ;)
I hear those Aussie enjoy their brew (ale) so Bill may not be getting off so cheap ;)

volatilegx

3:32 pm on Nov 19, 2006 (gmt 0)

I think that if a search engine company such as Ask is spidering (for whatever reason), they ought to identify themselves in the user agent. There is no reason why they couldn't even in a prototype crawler.

thetrasher

4:35 pm on Nov 19, 2006 (gmt 0)

Cloaked robot from Ask.com: [webmasterworld.com ]

ext9.eds.jeeves.ask.info (no A record) requests my default page every week with
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)".
This bot doesn't read robots.txt.

Maybe cloaking detection?