Forum Moderators: open

Message Too Old, No Replies

NextopiaBOT

Did not check robots.txt

         

bose

5:29 pm on Feb 25, 2006 (gmt 0)

10+ Year Member



Got a visit this morning from:
"NextopiaBOT (+http://www.nextopia.com) distributed crawler client beta v1.1"

It did not ask for robots.txt, and just helped itself to the home page. Their site provides no info on what this bot is, and what they intent to do with the stuff it brings home.

Pfui

1:50 am on Feb 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wow. That bot's been around a while in some way, shape or form. Check out this circa 1999 Web Robots Pages [robotstxt.org] info, about the same company's pjspider / PJspider [robotstxt.org] bot.

A quick whois-dot-sc check shows the same Toronto, Ontario-based company is still portaljuice.com and nextopia.com.

FWIW, I have no data on pjspider, but I last saw NextopiaBOT in Jan., Feb., and April of 2004 -- and even then, no robots.txt calls -- running out of similar "toronto-hse-pppXXXXXXX.sympatico.ca" addresses, with this UA:

"NextopiaBOT (+http://www.nextopia.com) distributed crawler client beta v0.8"

Slow ramp-up, eh? :)

No clue where the data goes, or if it's analyzed or sold -- or both, because the company offers a range of search-related products and services.

Regardless, since respecting robots.txt doesn't appear to be part of their repertoire, I block their bots.

wilderness

4:23 am on Feb 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Their site provides no info on what this bot is

This exaplantion not enough?
[nextopia.com...]

bose

5:13 am on Feb 26, 2006 (gmt 0)

10+ Year Member



That page is mostly brochureware, marketing hype. No mention of whether or not they honor robots.txt or exclusion standards, etc. No clear statement of what they plan on doing with the data so collected...

On the otherhand, that page has enough info to make me want to keep it off my playground. :)

The fact that it went for content without even looking for a robots.txt file certainly speaks volume through.