wwwster

Forum Moderators: open

Message Too Old, No Replies

wwwster

wwwster/1.2 (Beta, mailto:gue[at]cis.uni-muenchen.de)

bull

4:09 pm on Aug 12, 2004 (gmt 0)

The bot originates from 129.187.254.* (data centre for Munich education institutes). Until now, it has only pulled robots.txt on a poorly linked project.
According to the professor behind the e-mail address, a new german search engine will be the result. Further details were not be provided. A previous version did not include a contact info. I kindly suggested to provide a info page in the future instead of an e-mail address.

wilderness

9:13 pm on Aug 12, 2004 (gmt 0)

hey bull,
Any idea if there exists any way in the RIPE page to get a net breakdown of the "129.187."?

A RIPE search shows the entire 129.187.0-255.0-255 to one provider.

Thanks in advance

Don

bcolflesh

9:18 pm on Aug 12, 2004 (gmt 0)

This spider would fall in:

LRZ-MUNICH-NET
129.187.0.0 - 129.187.255.255

bull

6:51 am on Aug 13, 2004 (gmt 0)

Don,
bcolflesh is correct, though this spider will come from 129.187.254.* , where all proxies etc. of LRZ are placed. I am using their services too, so I know.

Lord Majestic

8:25 am on Aug 13, 2004 (gmt 0)

I kindly suggested to provide a info page in the future instead of an e-mail address.

Perhaps he wants to keep his project secret until right time comes, is it not reasonable of him to identify his crawler uniqely and provide his email address?

Note - I am not him ;o)

wilderness

12:19 pm on Aug 13, 2004 (gmt 0)

bcol and bull,
I understand that the bot is within the range, many thanks.
My inquiry may best be understood by an example of comparison.

Using a ARIN search, enter the following > 4.1.
the return will not be instanteous as you may be accustomed to. When the returm page appears, scroll down and look at the sub-net range registrations.

My inquiry was if any such search capability exists through the RIPE searches?

Don

addtionally, prior to eary 2002, ARIN rather than using the ">" character used the term "net".
I have a page saved from an IP Range that wokked under
"net 64.246.", however fails under the newer character use; "> 64.246."

As a result, I apprently have a range of subnet numbers from this particularly Everyone net which are no longer public info.

[edited by: wilderness at 12:43 pm (utc) on Aug. 13, 2004]

bull

12:40 pm on Aug 13, 2004 (gmt 0)

Perhaps he wants to keep his project secret

One can have as many secret SE projects as he/she wants, but not with my data. I think there are plenty of members here who think similar.

bcolflesh

12:45 pm on Aug 13, 2004 (gmt 0)

Sorry if I misunderstood your question -

Using a ARIN search, enter the following > 4.1.

All I get is the Level 3 Communication info for the range: 4.0.0.0 - 4.255.255.255

What do you get returned?

Lord Majestic

12:51 pm on Aug 13, 2004 (gmt 0)

I think there are plenty of members here who think similar.

Yeah I noticed - I appreciate that spiders that overload servers, disregard robots.txt etc are not playing by the rules, but to say that you expect every visitor to your publicly available site to provide you with sufficient information onto why they visisted your site and how they will use information consumed on your site is a bit excessive.

I noticed that significant number of members of this board appear to be extremely edgy about any spider apart from that of Google. You seem to fear and love that dependency on effectively single dominating search engine that currently matters. I think it is in strategic interests of everyone apart from monopoly to ensure that future challengers are supported, even if its in form of not banning them from your website.

I am not trying to tell you how you should run your website - I am merely saying that overly zealous position that some people take here is an unnecessary overkill. Its as if you did not want anyone else to find uses of your data similar to that you allow Google.

[edited by: Lord_Majestic at 12:54 pm (utc) on Aug. 13, 2004]

wilderness

12:51 pm on Aug 13, 2004 (gmt 0)

I think there are plenty of members here who think similar.

IMO bot which doesn't have a clear defintion of intended use for the data they are gathering, is NO DIFFERENT than a bot crawling un-identified.

As is always, each webmaster must make their own determination of what is benefical or detrimental to their website.

Many folks tolerate and accept crawls that I do not and although I realize that it it beyond comprehension that I also accept some crawls others may not :)

wilderness

1:03 pm on Aug 13, 2004 (gmt 0)

I noticed that significant number of members of this board appear to be extremely edgy about any spider apart that of Google.

Your making quite an assumption there?

Might you provide a Webmaster World Spider ID link

to support that?

Nobody here, at least that I'm able to recall has made any statement to allow google (or any other soltairy bot) and deny all others!

You'd rather have those repetitious DMOZ Clones first duplication the DMOZ data and then with a new bot claiming they are part of a the DMOZ partnership?

The desire here is to convey and discuss bots traversing the www.

What ever choice each webamster makes as to benefit and detriment is their own choice. I personally have stated time and again the my sites are quite unlike most other sights because of their limited share. That limited share has proved of benefit in assisting this group to identify new bot crawls.

Lord Majestic

1:14 pm on Aug 13, 2004 (gmt 0)

Nobody here, at least that I'm able to recall has made any statement to allow google (or any other soltairy bot) and deny all others!

Well, based on messages posted in this category (Search Engine Spider Identification) it appears that reaction to new crawler is more likely to be "get ips to ban it" rather than "wow, some guy might find good new use to the data and become Google 2 one day".

Say the original starter of this thread was not happy only because poor guy from educational institution in Germany included his email rather than link to page. He pulled robots.txt, clearly he intends to respect these - just give him a break!

Come on people - you are not minding Netcraft pinging your servers and checking versions of your webserver to later profit from this information derived from your data? What is the proportion of bandwidth "stolen" by those spiders anyway - 1%? Or 0.1%?

People who embark on crawling the internet (and I don't mean here people who just spider whole site to republish it elsewhere) are working on advancing the Net forward, give them a break - their task is tough enough.

[edited by: Lord_Majestic at 1:21 pm (utc) on Aug. 13, 2004]

wilderness

1:20 pm on Aug 13, 2004 (gmt 0)

What do you get returned?

bcol,
I get a 917kb html page with all the sub-nets.
Converted to a text file that is over 3850 entires and or lines of data.

I made multiple attempts to sticky the text file to you and whether it was my machine that failed or the sticky SYS, I'm not sure. It any event, I couldn't paste what I had copied.

Are you using the ARIN page
( [arin.net...] ) or a clone?

bcolflesh

1:24 pm on Aug 13, 2004 (gmt 0)

I've been using:

[arin.net...]

I tried in a couple broswers, but I still don't get the results you receive - I'll try through some proxies later - some kind of regional targeting?

wilderness

1:41 pm on Aug 13, 2004 (gmt 0)

Well, based on messages posted in this category (Search Engine Spider Identification) it appears that reaction to new crawler is more likely to be "get ips to ban it" rather than "wow, some guy might find good new use to the data and become Google 2 one day".

More assuptions.

Say the original starter of this thread was not happy only because poor guy from educational institution in Germany included his email rather than link to page. He pulled robots.txt, clearly he intends to respect these - just give him a break!

More assumptions yet?
There have been harvester bots which are gracious enough to provide email addresses in the UA, so what?

You give him a break if you desire (see my closing note.)
Bull's ISP falls within the same user group (see other msg#) and so that means bull reside at the very minimum in that portion of Europe and bull has decided not to allow access. Bull's decsion is good enough for me to follow.
Neither Bull or myself have suggested that you or any other webmaster make the same decision.
ONLY that you consider what is beneficial or detrimental to your websites.

Come on people - you are not minding Netcraft pinging your servers and checking versions of your webserver to later profit from this information derived from your data?

And yet again, more assumptions?
How can you possibly know what each participant here is minding or pinging?
Although a few provide URL's in their profiles, many do not. MOST take the precaution to edit out any possible refernces which might pinpoint their own webiste.

What is the proportion of bandwidth "stolen" by those spiders anyway - 1%? Or 0.1%?

Who cares?
The quanity is insignificant. What matters (at least to me) is that a corrrective action has been set in place to prevent that visitor, the other wise easy access of harvesting of which it has grown accustomed to.

BTW, when this thread opened, I almost inlcuded a bit of personal humor to Bull. I have nearly ALL of RIPE IP's denied previously of which Bull and others are aware of and as a result this bot will never see my pages, unless it fakes.

wilderness

1:58 pm on Aug 13, 2004 (gmt 0)

Are you using the ARIN page
( [arin.net...] ) or a clone?

bcol,
on the page above and in the explantions, the following is supplied:

Record hierarchy:
Records in the ARIN WHOIS database have hierarchical relationship with other records. To display those related records, use the following flags:

< Displays the record related up the hierarchy. For a network, display the supernet, or parent network in detailed (full) format.
> Displays the record(s) related down the hierarchy. For a network, display the subdelegation(s), or subnets, below the network, in summary (list) format. For an organization or customer, display the resource(s) registered to that organization or customer, in summary (list) format.

Use of this in every search or application can be quite discouraging. However, in some instances, and especially when you are returned double and triple subnet names on an ARIN search the rewards can result in the example I priovided.
SBC has an even more extensive return on some of their ranges than the 4. range.
AT&T has some diverse ranges as well.
There are more, however these few come to mind quickly.

bcolflesh

2:09 pm on Aug 13, 2004 (gmt 0)

Aha - the actual query is "> 4.1." - I glossed over the greater than sign - thanks for enlightening me.

Lord Majestic

2:18 pm on Aug 13, 2004 (gmt 0)

Although a few provide URL's in their profiles, many do not. MOST take the precaution to edit out any possible refernces which might pinpoint their own webiste.

Is that a requirement for accessing your website?

You seem to want to have a degree of control over how visitors to your presumably publicly accessible site will use your content. Have you ever felt wishing you had PDFs like permissions - to print, edit, archive, not cache etc?

What matters (at least to me) is that a corrrective action has been set in place to prevent that visitor, the other wise easy access of harvesting of which it has grown accustomed to.

Well, I will insist on assuming that people like you are in minority in relation to total number of web publishers out there - and its a good thing! If people treated web like PDFs then there would be no WWW as we know it :-p

wilderness

5:01 pm on Aug 13, 2004 (gmt 0)

Well, I will insist on assuming that people like you are in minority in relation to total number of web publishers out there - and its a good thing! If people treated web like PDFs then there would be no WWW as we know it :-p

I refuse to call anybody Lord, especially when that is a title used today in your homeland.

Majestic,
Unfortunately you'd rather sprout very subtle insults rather than attempting to understand something you have not any knowledge of :(

Quite a few participant in this thread and very little from me:
[webmasterworld.com...]

What to do with names and I.P. addresses of spiders?
[webmasterworld.com...]

There are many more threads at Webmaster World which offer expanded explantions on the possibilities of data mining.
Apprently the existence of the archives are not a useful tool.

This forum was down for some months after reacing an excellent growth. Perhaps Search Engine Spider ID will never be what it once was? Perhaps it will?
In either instance the past threads are still maintined in the archives, thanks to Brett.

Lord Majestic

5:10 pm on Aug 13, 2004 (gmt 0)

I refuse to call anybody Lord, especially when that is a title used today in your homeland.

Its okay - you seem to have strong character and I would not dare to insist on anything :)

Unfortunately you'd rather sprout very subtle insults rather than attempting to understand something you have not any knowledge of

Well you sure are not alien to making some assumptions :) Apart from dry sense of humour I do not intend to insult anyone, I hope you not confusing insults with mere disagreement.

Anyhow, I do understand your position on the matter, however I do not accept it as I think its alien to the principles of the Net that made it possible for WWW to be what we know it.

Perhaps some might feel better (in control?) by banning some guy who obeys general standards (robots/no excessive load) from educational institution.

Sometimes it is not known what project will yield until sometime in it - sometimes end results are completely different from what was originally intended (think thats how Viagra was created, not that I am into that thing!). As the result it might be hard to say for sure what will happen with the data collected.

I hope the research that this guy was probably doing will not be affected by this unInternet behaviour, which I think is shown by a very small minority of webmasters. If majority did the same thing at Google's time then I guess Google would not have become what it is now.

That said its your site and you can do whatever you want with it. Anyway I think I said enough on the topic, so lets just register agreement to disagree.

-Majestic

bull

6:05 pm on Aug 13, 2004 (gmt 0)

I nowhere said I blocked the bot I described in this thread. I started this thread to provide the information I gathered.
To repeat: I e-mailed the "poor guy", as described by Majestic here, to get the information usually provided on a robot info page. I then decided to let the bot crawl.
There is no major search engine bot without such a page. Even my own tiny bot, sucking an italian newspaper site every two days, has one.

I have experienced that educational institutions are among the most aggressive crawlers, because they are educational institutions, also with fake UAs.

The discussion (read thread hijacking) Majestic started here is anyway redundant. There are plenty of threads with the same topic, all from last year or older. I think the thread can now be closed.

Lord Majestic

6:11 pm on Aug 13, 2004 (gmt 0)

There are plenty of threads with the same topic, all from last year or older.

Yes and many of them have the same theme with few or no words of support for people who code bots. You did not ban him, but you were clearly displeased at the fact he provided email instead of URL to page.

Sorry for "hijacking" thread, but I felt it was important to make that point in defence of nonmalicious bots. I now consider myself having made it and shutting up.

wwwster

wwwster/1.2 (Beta, mailto:gue[at]cis.uni-muenchen.de)

bull

wilderness

bcolflesh

bull

Lord Majestic

wilderness

bull

bcolflesh

Lord Majestic

wilderness

wilderness

Lord Majestic

wilderness

bcolflesh

wilderness

wilderness

bcolflesh

Lord Majestic

wilderness

Lord Majestic

bull

Lord Majestic

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week