Forum Moderators: open

Message Too Old, No Replies

Taiga web spider

         

wilderness

3:42 am on Jan 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



128.148.31.zzz - - [18/Jan/2007:12:14:24 -0800] "GET /robots.txt HTTP/1.1"
200 4417 "-" "Taiga web spider"

robots.
Root.
Search help page.

incrediBILL

1:01 am on Jan 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I spotted this thing back in November and the IP address is for some research project at Brown University.

OrgName: Brown University
OrgID: BROWNU
Address: 115 Waterman Street
City: Providence
StateProv: RI
PostalCode: 02912
Country: US
NetRange: 128.148.0.0 - 128.148.255.255
CIDR: 128.148.0.0/16

Probably not worth letting it crawl unless it graduates ;)

Here's something using the name Taiga:
"Taiga: Internet-scale computing"
[cs.brown.edu...]

Another candidate for your crawler might be this:
"Stochastic Models for Web Agents and the Web Environment"
[cs.brown.edu...]

wilderness

1:25 am on Jan 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I spotted this thing back in November and the IP address is for some research project at Brown University.

You are to be commended then for taking the time and effort to share it with the participants of Forum 11 as well as remaining "proactive" ;)

[google.com...]

[google.com...]

[google.com...]

incrediBILL

2:08 am on Jan 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do I detect an amount of discernible sarcasm in that snide remark? :)

Let me put it this way, ever since I wrote my own web site bot blocking software, I find so many new bots and crawlers that I could spend all day posting and/or blogging about them and never be finished.

Therefore, I just post and/or comment about the ones that amuse me or have something particularly interesting to note. For instance, today I've already had over 300 bots that are knocking on my door asking for things they will never get, many new IPs never seen before, and this is a daily event.

Luckily for you, I'm proactively archiving this information into the ultimate bot blocking quarantine list!

Ever heard of "O#*$!earch/1.x (www.o#*$!earch.com)"?
WebmasterWorld software scrambled it, it's OPEN and I and SEARCH.

How about "BilgiBot/1.0(beta) (http://www.bilgi.com/; bilgi at bilgi dot com)"?

Or maybe "ICC-Crawler(Mozilla-compatible; [kc.nict.go.jp...] icc-crawl@ml.nict.go.jp)"?

and on and on and on...

See, that's the difference with whitelisting, they are all blocked by default.

[edited by: incrediBILL at 2:12 am (utc) on Jan. 24, 2007]

wilderness

2:19 am on Jan 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do I detect an amount of discernible sarcasm in that snide remark?

You mean the indirect word "jaded" confuses you ;)

Luckily for you, I'm proactively archiving this information into the ultimate bot blocking quarantine list!

And that will likley appear here as soon as the active search engine for the former bot participant of forum 11 that has been crawling pages for more than three years ;)

BTW your list will will about useless to myself?

The majority of bot and/or new bots appear from RIPE or APNIC ranges which do not get in my websites.
Hell! I don't even take note of bots from those IP ranges.

incrediBILL

2:53 pm on Jan 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The majority of bot and/or new bots appear from RIPE or APNIC ranges which do not get in my websites.
Hell! I don't even take note of bots from those IP ranges.

I think your definition of a bot is too narrow as most things crawling these days rarely have an identifiable user agemt, and they aren't from Asia either.

wilderness

5:45 pm on Jan 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



These threads might be a bit more interesting if some others injected their thoughts and experiences?

Or perhaps it's more amusing watching (reading) Bill and I throwing stones ;)

thetrasher

7:06 pm on Jan 24, 2007 (gmt 0)

10+ Year Member



This thread goes OT. Here are my stones.

Forum 11 is about "Search Engine Spider Identification". Reporting all non-browser actions on the net is not the topic here. Too many bots are in the net. Sometimes if I examine log files, I think I'm the only man in a bot world. Posting all bot occurences would lead into chaos, hence it would not be helpful.

Blacklisting is a Sisyphean task. Continuously new bots emerge. Forum 11 provides a little help in this eternal fight, but don't expect too much.

If you want to block nearly all bad bots, you need whitelisting. Open internet must be closed. Is your front-door always open?

wilderness

7:46 pm on Jan 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Many thanks trasher.

Forum 11 is about "Search Engine Spider Identification". Reporting all non-browser actions on the net is not the topic here.

For the most part I agree.

Problem is that we have numerous bots and/or crawls starting out unindentified with patterns that have proven in the past to be the beginnings of a yet to be named bot or SE.
(One example is the recently reactivated Amazon thread, which began crawling anonymously and then switched to a Java UA and recently two bot names.)

Don

Mokita

10:51 pm on Jan 24, 2007 (gmt 0)

10+ Year Member



Forum 11 is about "Search Engine Spider Identification". Reporting all non-browser actions on the net is not the topic here.

In addition to the valid point that Don made, I think Forum 11 would die a slow death from boredom and disuse if it was confined to SE spiders only.

Obviously all major and minor SE spiders have already been identified and only the occasional new one pops up. The activity created by threads on SE spiders alone, would not justify being a separate forum.

Maybe this forum should be renamed "Bot, Spider and Crawler Identification" to match the topics currently being posted. There is no other forum suitable to move to.

incrediBILL

6:48 pm on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If your definition of a spider is only something you can see with an identifiable user agent then pack up and close shop now because I can think of a bunch of corporate spiders that don't want to be seen such as Picscout, Cyveillance, NetSweeper, etc. which can only be detected by their activity and nothing else.

Therefore, without tracking things that pretend to be browsers but behave like bots you would never identify these crawlers.

It's just more challenging is all ;)

wilderness

7:36 pm on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If your definition of a spider is only something you can see with an identifiable user agent then pack up and close shop now because I can think of a bunch of corporate spiders that don't want to be seen such as Picscout, Cyveillance, NetSweeper, etc. which can only be detected by their activity and nothing else.

Bill,
Everybody's aware your a busy man with massive resources dedicated to tracking and identifying crawls, spiders and UA's far beoynd the capabilities of any other participant in this fourum, however. . . . for the sake of cordiality, communication and understanding?

Could you possibly spare a few seconds and provide what submission this brilliant deduction was the result of?
Possibly in a quote?

Many thanks for your understanding and tolerance of everybody elses incompetence.

Don

wilderness

7:38 pm on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



BTW, there are numerous threads of Cyveillance and NetSweeper in the Webmaster World archives.

incrediBILL

8:14 pm on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Many thanks for your understanding and tolerance of everybody elses incompetence.

Huh?

I was just giving an example of why we should discuss some bot activity that wasn't so simple to identify as sometimes it takes more than a few clues to figure out it's Cyveillance or Picscout at work, assuming we can ever figure them out.

BTW, there are numerous threads of Cyveillance and NetSweeper in the Webmaster World archives.

Did I say there weren't?

Again, examples...

[edited by: incrediBILL at 8:15 pm (utc) on Jan. 25, 2007]

wilderness

8:21 pm on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



BTW, there are numerous threads of Cyveillance and NetSweeper in the Webmaster World archives.

Did I say there weren't?

Again, examples...

If you need examples for your blog references?
I'd suggest searching the Webamster World archives or Google.

If your interest in assisiting other participants in this forum?
I haven't seen anybody other than yourself (at least in this thread) inject or request information on Cyveillance and NetSweeper.

Thus you are just as capable as I am of searcing for "examples".
Happy hunting.

incrediBILL

8:37 pm on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Um, we're having a disconnect.

I didn't say I was LOOKING for examples, I was USING them as examples of the types of crawlers that don't want to be found.

Sheesh.

wilderness

8:51 pm on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I didn't say I was LOOKING for examples, I was USING them as examples of the types of crawlers that don't want to be found.


Your are to be commended with sharing those examples with other participants of SSID/Forum 11 as well.